Steal My Idea: Evaluating LLM Systems with Production Data at Scale

A framework for fixing the gaps in your LLM-testing strategy

Feb 11, 2025

Building things with LLMs that need to work in the real-world? This post is for you. Source: Author provided.

In my last post [1], I described how my team and I have been testing our WIP conversational assistant, despite having no baseline or benchmarks, and despite the LLM testing landscape being relatively immature. But there’s still a gap when it comes to testing complex LLM-based systems and conversations, and today, I’m presenting a means to address it: An automated framework for evaluating multi-turn LLM interactions using real-world data.

The idea is that, if you can properly overwrite the “state” of a conversational AI system, you can simulate its behaviour in virtually any situation. Sounds simple, but the devil is in the (implementation) details, and those will depend on your application. This post is full of ideas for how you can build— and use—this new framework yourself.

The Final Testing Gap: Multi-Turn Conversations, with Realistic Data, at Scale

Let’s start with a little context. My team are currently building a conversational assistant — a customer service chatbot — for a large telco. It’s essentially a giant prompt describing our business processes and any information needed to execute them. At runtime, the LLM predicts the appropriate process, extracts relevant information, and passes them to our existing systems for execution.

To benchmark our development progress, we’ve so far used a mix of manual and automatic tests for end-to-end conversations, plus we’ve “simulated” our bot with datasets of individual utterances, one at a time, taken from prod. Each technique has benefits and challenges, but there’s one big problem left…

A table describing our useful — but not yet perfect — testing strategies so far. For more details, see my other post: https://medium.com/towards-data-science/lessons-from-agile-experimental-chatbot-development-73ea515ba762 — An overview of our useful — but not quite perfect—testing strategies so far.

The issue is, it’s easy to build a bot and think it’s working, but that’s only because you know how to talk to it. Real customers don’t. So you need to test it with a wide variety of customer communication styles. And these need to be multi-turn dialogues, to test your bot’s ability to ask clarification questions, handle topic switches or failed logins by the customer, and generally manage whole conversations.

Unfortunately, our Simulation approach so far only works for the customer’s first utterance. So the only way to observe our bot’s behaviour on diverse, complex conversations is to “test in prod” 😱. We rollout the changes for a tiny portion of customer traffic, then intensely monitor the system for errors or unexpected metrics changes. But thanks to the staggered rollout, it can take forever to notice any difference in bot behaviour, and we might never observe the most serious edge cases at all!

My Proposed Solution

Summing up: we want to be able to test a multi-turn, LLM-based system at any stage of a conversation. And we want to do this automatically, at scale, using real customer data. Here’s how I propose to do it…

Most conversational AI solutions have a “state”, which includes dialogue history and contextual data like user information, results from function calls, information extracted from the dialogue, and so on. If we can overwrite this state, using realistic, historic data, we can simulate any type of situation within any multi-step process involving an LLM. You can think of it like time travelling: you dump your bot into a specific point in a specific historic interaction, and see how it reacts. Repeat this for a whole dataset of production data where those specific conditions were present, and you’ll have a good, general understanding of how the bot will handle such situations in the future.

Now, I can’t tell you exactly how to overwrite the state in your own application. But I can say that there’ll be hacky ways to approximate it, and other ways to overwrite it more realistically. Both can be helpful, so I suggest you start with the first. But the more closely your state injection represents the real workings of your bot, the more use you’ll be able to get out of this framework.

The next two examples — a quick and hacky implementation and a more rigorous one — will show you what I mean.

An example showing how concatenating user utterances into a single first turn triggers the same final predictions as in the original manual test. — A ‘hacky’ implementation: Inserting concatenated customer utterances into the first turn produces similar results as a manual E2E test

Implementation Example: Hacking It

Here’s what happened when I simply approximated our bot’s state for a multi-turn conversation. The question I was interested in was: how good is our intent detection accuracy when the customer is unclear and our bot has to ask the customer to clarify their issue.

I gathered call transcripts from some recent manual E2E tests, where my colleagues were pretending to be the customers (and deliberately asking vague questions). For each conversation, I concatenated all the “customer” turns into one utterance, and fed that to the bot as the start of a new conversation. This is only an approximation of the state, because I ignored any interim bot responses, and any context data that built up between the turns. So it’s not a perfect test — it doesn’t completely reflect runtime behaviour — because some bot actions are only possible at later stages of the conversation, and I simply threw everything into the first turn. Nevertheless, the detected intents were almost identical to those from the manual tests.

Now that’s not too surprising, since it’s the same bot. But it showed that even super simple “context hacking” produced comparable results to running the tests by hand. And instead of taking multiple people many hours each, it was done in minutes!

Implementation Example: Doing it Right

The above example showed that, depending on what you want to know about your LLM-based application, even an approximate state injection can be helpful. But now let’s see an example where the overwriting is done properly, so that the test execution better reflects the real-time workings of the bot. Afterwards, we’ll look at the diverse testing use cases that are possible when your injection is done well.

Imagine you’re re-implementing your “old-school” customer service chatbot using LLMs. Your existing bot uses a mixture of machine learning, decision trees and rules to detect customer intents — such as re-sending a customer invoice—and trigger actions. You’re now delegating all these steps to an LLM, by describing possible customer intents and letting it predict the appropriate one (aka “few-shot prompting/classification”). You want to compare the new bot’s performance to the older, live system.

First, gather some historic chat transcripts in which:

the predicted intent was “resend invoice”
there were at least three pairs of {customer: bot} messages
all API and function calls were successful
the customer never asked to speak to a human, and answered “yes” when asked whether their issue was resolved

Take a sample of e.g. 1000 conversations, and voila! You’ve got something like a labelled dataset, without a huge, hand-labelling effort.

Now, for each conversation: send the customer’s first utterance (i.e. the first conversational “turn”) to your new bot, record the response, and repeat for all 1000 conversations. Finally, evaluate all the bot’s responses using an appropriate metric:

A diagram showing a single customer utterance from production, and a single reply from our new bot. This is repeated over the entire dataset of conversations, to arrive at a metric. — Simulating our bot using a customer’s original opening utterance.

Next, for each conversation, fill the bot’s conversation history with the original {user-bot-user} interaction, plus any slots that had been filled by then (e.g. “intent=bill_copy”). When you implement this context injection, make it realistic: i.e. if certain outcomes are only supposed to be possible at certain stages of a conversation, make sure you maintain that logic here. Record your new bot’s response, repeat for the entire dataset, and evaluate.

A diagram showing the first three turns from a conversation from production, and a single reply from our new bot. This is repeated over the entire dataset of conversations, to arrive at a metric. — Simulating the new bot with the customer’s first two turns (and the old bot’s original response in-between)

Finally, repeat this process again, incorporating yet another pair of historic {bot: customer} utterances into your hacked conversational context.

A diagram showing the first five turns from a conversation from production, and a single reply from our new bot. This is repeated over the entire dataset of conversations, to arrive at a metric. — Simulating the new bot with the customer’s first three turns (and the old bot’s responses)

How to Use This Framework

The key idea here is that once you have a mechanism that can correctly overwrite conversation state at any stage in any multi-step process, you can explore how your bot would handle any kind of situation for which you have prod data available. For example:

How does your conversational assistant cope when customers make mistakes or abruptly switch topics?
Does it gracefully handle chitchat, or demands to talk to a human?
What happens when a function call fails or the user can’t login?
How many customer turns should you incorporate into your document retrieval query for your RAG component?
How does changing the LLM model affect your bot’s behaviour?
How did your latest prompt changes affect how often the bot predicts different values (e.g. intents) or call various functions? Is that change expected, or alarming?

The list is virtually endless. What’s more, the framework is metric agnostic: Use whatever makes sense. For example…

Proxy classification accuracy

The second implementation example used data samples where the intent was “resend invoice” and the customer’s behaviour indicated this was the right choice. It’s basically a dataset of true positives, which means the count of times your new bot predicts “resend invoice” is a proxy for TP classification accuracy. If this metric starts sinking, there’s probably a problem.

True accuracy (with ground truth labels and deeper analytics)

It’s not always true that a happy customer means the detected intent was correct (especially since, even if a customer is routed to the wrong service team, they can usually resolve the issue anyway). Moreover, most people only give feedback if their issue could not be resolved, so you miss out on examples from silent but happy customers. Finally, the data collection method described above also ignores all the expressly unsatisfied customers. But those are the cases where something clearly went wrong, and needs testing.

This is where the effort to create even a small labelled dataset can be worth it. It’ll allow you to properly compare the classification accuracy of your new bot versus the production one, and you can generate classification reports and confusion matrices, highlighting where the LLM is mixing up different concepts. These can indicate weak spots in your prompting, which you’ll now be able to tackle.

Turns to Target

A related idea (I just made up) is: how many conversational turns does it typically take for your bot to decide on the best outcome for the user? Customers don’t generally love having to answer loads of disambiguation questions, and this framework could help you track how you’re doing in that regard: You just have to record the turn index where the bot made the relevant prediction (e.g. triggered an action or filled the customer intent slot).

Custom metrics using a judge LLM

Libraries like DeepEval let you prompt a judge LLM to evaluate your bot’s outputs. This can be helpful for assessing your conversational assistant’s behaviour overall, rather than on specific tasks like classification. Just be sure that, even if the judge LLM considers the entire conversation, it only actually judges the last step (i.e. the output of your new solution). Otherwise, you’ll blur the performance between two bots, and won’t learn anything.

Getting this kind of judge LLM prompting right is tricky, which is just one reason I’m rather skeptical of that approach. But I’ll save that for another post…

Diverse changes over time

Literally anything you track can be used to quickly analyse overall changes to bot’s behaviour. For example, average length and number of tokens of bot answers. Average number and semantic similarity of documents retrieved by the RAG component. How often disambiguation questions are triggered after the customer’s first utterance. Frequency of LLM timeouts… anything.

Performance over progressive turns

My second implementation example involved evaluating your new bot after each customer turn. You don’t actually have to do that: a well-implemented state injection mechanism will let you time-travel to any conversational situation you want to test. But evaluation over multiple turns can be useful for exploring performance in complex conversations.

For example, assuming that your latest changes have improved the bot’s behaviour overall, then its performance will probably be worse on tests where more of the conversation is “handled” by the older bot from production. You’ll see this trend in the Avg {metric}s in the images above. Declining metrics aren’t an issue here: they’re still useful as a benchmarking tool. For instance, if you ever deploy a change and notice performance being much worse on later turns than it typically is, then you’ve probably introduced a problem.

Human expert evaluations

While we’re aiming for automation here, human evaluations can be a useful sanity check, especially if any of your metrics changed in a way you don’t like. And luckily, unlike other forms of manual evaluation where you have to type (or call) your chat- or voicebot yourself, the results are now already ready for you to investigate.

Gotchas: How others have approached this issue

Surprisingly, most LLM benchmarks ignore the multi-turn evaluation problem, with just a few exceptions:

Some approaches simulate conversations using another LLM. However, this is expensive, power hungry and slow. Moreover, an off-the-shelf LLM will never talk like our customers do (unless we fine tune it: not exactly a quick fix!). GPT was trained to write fluent prose, not to act like a shy, distracted person calling a their mobile provider while on a noisy tram on the way to work. But that’s the kind of situation our bot will have to handle, so we need data which represents it.
Other approaches involve writing manual E2E stories (which I complained about already), or generating them with an LLM (which suffers the same issues as above).

You might also be wondering, why don’t we just use production data for the customer sides of the simulation, and live responses from the new assistant on the bot side? Unfortunately, as the following example shows, communication breaks down when two entities ignore each other. It could be useful for revealing how your bot deals with such situations, but there are infinite ways a conversation can go wrong, and eventually, you can’t learn more from testing them all.

A failed conversation, where the customer sides and bot sides don’t refer to each other at all, highlights the complexity of our challenge, and the need for the proposed solution. — A failed conversation highlights the complexity of our challenge, and the need for the proposed solution.

Conclusion

When building a conversational application with LLMs, multi-turn testing with realistic data is vital, but difficult. Writing E2E tests manually is unscalable and prone to tester bias, and generating them with an LLM is unrealistic. You could have another LLM play the user, but that only addresses the scaling problem. There are some great LLM testing tools available to help you explore your model’s behaviour while developing it, but for the post-deployment and continuous improvement stage, you need a large scale, automated testing framework that works with realistic, multi-turn interactions.

Today, I’ve proposed a multi-turn simulation framework which can be used for any multi-step LLM system, with any metric, and with realistic data. It can be run quickly, cheaply and automatically, even at scale. All you need* is a mechanism to overwrite the history and state of your LLM-based application. (*I know, sounds so easy, right?!)

If this post inspires you to get building, let me know how it goes!

[1]