No baseline? No benchmarks? No biggie! An experimental approach to agile chatbot development
Lessons learned bringing LLM-based products to production
What happens when you take a working chatbot that's already serving thousands of customers a day in four different languages, and try to deliver an even better experience using Large Language Models? Good question.
It's well known that evaluating and comparing LLMs is tricky. Benchmark datasets can be hard to come by, and metrics such as BLEU are imperfect. But those are largely academic concerns: How are industry data teams tackling these issues when incorporating LLMs into production projects?
In my work as a Conversational AI Engineer, I'm doing exactly that. And that’s how I ended up centre-stage at a recent data science conference, giving the (optimistically titled) talk, “No baseline? No benchmarks? No biggie!” Today’s post is a recap of this, featuring:
The challenges of evaluating an evolving, LLM-powered PoC against a working chatbot
How we’re using different types of testing at different stages of the PoC-to-production process
Practical pros and cons of different test types
Whether you're a data leader, product manager, or deep in the trenches building LLM-powered solutions yourself, I hope I can spare you at least some of the mistakes we made. So without further ado, let’s get into it.
Setting the Scene
My company—a large telco—already has some pretty advanced conversational AI systems for both voice and text, including a multilingual chatbot which assists thousands of customers a day with question answering, end-to-end use cases, and transfers to real agents. We’re pretty proud of it, but we know Gen AI and LLMs can help us make it better, and implement those improvements in a more scalable manner.
Our vision is a chatbot that can take an entire conversational context, plus company and customer data, to serve diverse use cases according to defined business processes. It should be built on top of a framework that allows us to craft a controlled interaction between user and system—a so-called “on rails” approach—and to add new use cases easily, to continually improve the customer experience.
What we Want vs What we Have
Sounds great, but how on earth are we going to build this? We want to be test-driven, agile, and get feedback fast, making design decisions with confidence, based on clearly defined KPIs. But how? How can we measure our progress, when we have:
No benchmarks: For any company doing something like this, the problem is, by definition, unique. No other bot exists to serve the same customers, regarding the same products and services, as ours. This means: no benchmark data to test on.
No baseline: We're also comparing our WIP bot to a working chatbot that's been iterated on for years, making any early-stage comparison pretty unfair (how do we convince stakeholders of the value of our project, when of course the old bot has a much higher automation rate?!) Our new solution also works completely differently to its predecessor, so we need a whole new way to test it.
A New Way Forward
This has certainly been tricky, but through much trial and error, we’ve figured out a test-driven approach that’s working for us. It focuses on three key processes:
Internal testing
Customer trials
“Simulations”
Let’s now check out the pros, cons, and actionable takeaways of each test type.
Internal Testing:
How it works: Currently, different parts of the bot are implemented by different domain teams: the billing team builds billing use cases like payment extensions, the admin team implements password resets, and so on. In internal testing, each team defines scenarios based on the use cases they implemented. For example, “you are customer 1234 and you want to extend your payment deadline.” The scenarios include a so-called “happy path,” which describes how a successful interaction between customer and bot could look. Team members from other teams then pretend to be the customer and try to get the tasks done using the bot. They take notes and give each scenario a rating, and finally, all teams review the scores for their scenarios, and share improvement ideas with the entire group.
Pros: The key benefit is that the teams have enough general domain knowledge about our company and how the bot works to be able to probe for edge cases (which is, after all, where the problems hide). Yet because the testers didn't implement the use cases themselves, they can't accidentally “cheat,” by using the same phrasing that the implementers had in mind during development. This helps us identify use cases which fall apart when presented with unusual phrasing of a customer request.
Internal testing also helped us prepare for the customer testing (up next), by revealing the potential diversity and room for confusion in even the most straightforward of customer interactions.
Cons: This type of testing is manual and time-consuming, meaning we can only test a small number of scenarios. It's also subjective and prone to evaluate a misunderstanding: evaluators sometimes even misunderstand the happy path, and thus judge the output incorrectly.
Lessons Learned:
Alignment problems aren’t just for LLMs: Our first round of internal testing featured a simple rating scale: Bad-OK-Good. We found only afterwards that some people had rated interactions based on how well the bot stuck to the happy path, while others judged the customer experience. For example, if the scenario was supposed to trigger certain logical steps, but the bot instead returned a high quality RAG1 answer, then some raters would penalise the bot, while others praised it. Thus, we learned that we need a rating system that captures both bot “behaviour” and bot quality. This way, if the bot is “misbehaving,” but this produces a better experience, we can react: rethinking our implementation, and checking for misunderstandings or misalignments in our expectations of what customers want.
Don’t forget to define your test outputs: Our first tests also revealed a need to align on how to write useful comments, else people miss important details, or take fuzzy notes that don’t make sense afterwards. We also needed to agree on how to best preserve the chat transcripts: Some people copied the log output, which is almost too verbose to be usable, while others just took screenshots of the UI, which is impossible to search or process in any kind of automatic way later. We failed to brainstorm such issues in advance, and paid the price in hard-to-manage test outputs afterwards.
Actionable Takeaways:
Define a clear, precise evaluation scheme: We came up with an evaluation matrix featuring three metrics, the exact values that they can take, and a guideline for how to apply them. This makes it easier for testers to test, and for teams to aggregate results afterwards. The goal? Maximum learnings; minimum evaluator effort.
Have good data management: For us, this means little things, like: adding test case IDs indicating the specific scenario and tester, automatically capturing the transcripts, logs and steps called by the bot behind the scenes, and trying out the Zephyr test management tool, which provides a more structured way to define, test, and re-test different scenarios.
Customer Testing:
How it works: We invited a mix of customers and non-customers to our offices and had them attempt to accomplish tasks using the bot. The tasks were similar to the scenarios described earlier—such as trying to extend their payment deadline—but testers weren’t given a happy path, or told in any way what to expect. They were recorded and encouraged to speak out loud as they worked, sharing their expectations and impressions as they waited for the bot's responses.
Pros: Having a diverse mix of participants leads to varied and unexpected behaviors. The way they interacted with the bot helped us see how tolerant customers are to issues like latency (a major headache with LLMs!), and revealed customer attitudes towards the technology. For example, some testers, including the young and tech affine, were surprisingly cautious and skeptical of getting things done with a bot, and said they wouldn't trust the outcome without an additional written confirmation.
Cons: Customer testing was highly time consuming to organise and execute: participants were chatty and/or slow, so we could only test two or three scenarios each. Such a small sample size means we also have to be careful of outlier feedback: if someone vehemently hates something, that doesn't mean it's an absolute no-go.
Again it was also a challenge to figure out what the test observers should note down, and we realised too late that we should have aligned on which aspects would be most useful for drawing actionable insights.
Finally, although we told participants that our bot was a bare bones POC, they still complained of missing functionality they’d seen in ChatGPT and similar tools. While that’s interesting to see, we felt it distracted them from some other feedback they could have given us.
Lessons Learned:
Customers are learning from LLMs… in unexpected ways: For example, customers with experience using tools like ChatGPT wrote fluently and conversationally, and expected the bot to handle it. Less experienced testers wrote in a “keyword search” style, fearing they’d confuse the bot otherwise. And some young participants who were familiar with LLMs used this keyword style on purpose, hoping that the bot would respond with similar brevity. That was a completely unexpected and creative attempt to manipulate the bot, based on an understanding that LLMs can be prompted to respond in different styles. It proves to us that our system will need to be robust to many types of interactions, perhaps adapting its behaviour to suit.
Customers don’t want to do things the way you might expect: For example, while the industry rejoices in LLMs and “conversational everything,” our test participants weren’t that excited about the prospect. In some cases, such as when presented with a choice of invoices they’d like to delay payment on, they said they’d rather use a button to select, since “it’s faster than typing.”
This was quite a slap into reality, and reminded us that you can't please everyone. We sometimes received completely opposing feedback from different participants for the same task. This is a challenge with building any kind of consumer product, but it's good to remember, at least for your own sanity.
Actionable Takeaways:
Design principles are invaluable: … At least for an experimental project like ours. Thus, we collated our observer feedback into a set of general design principles. For example, we sometimes felt like the bot stuck too closely to our business logic, missing contextual cues from our test participants that should have swayed the process. So, we made it a principle that our bot should always prioritize conversational context when responding. By having this clearly stated, it can help guide us through our development, by being included in things like future internal tests and story acceptance criteria.
Simulations:
How it works: We have an annotated dataset of historic chat interactions, which includes customer utterances, actions triggered by the existing system, the domain detected by our classifiers, and a ground truth domain label added later. Each sprint, we run those customer utterances through the latest version of our WIP chatbot, in order to test two things.
First, automation rate: how often does the new bot trigger an end-to-end use case versus a “T2A” (transfer to a call-centre agent)? How does this compare to the existing, live system’s automation rate? Second, how does the classification accuracy compare? We found a way to measure this, despite the two bots functioning completely differently. So, although the new bot doesn’t actually do domain detection, we can map the commands triggered by the new bot back onto domain labels used by the production one, giving us an apples-to-apples comparison.
For the rest of the evaluation, we split the test utterances and WIP bot responses among the domain teams, who then manually review their quality. It may sound like a tonne of work, but we’ve found ways to make things quicker and easier. For example, if the bot’s response is “fixed” (meaning it’s never rephrased by an LLM), then as soon as an evaluator marks that response as “accurate”, certain other metrics will automatically be filled. This speeds up the process, reduces decision fatigue, and helps ensure high quality and consistency from evaluators. Afterwards, we aggregate the evaluation scores, and create stories to tackle any specific issues we observed. The scores are also directly linked to KPIs in our development roadmap, enabling us to determine whether we’re satisfied with our latest changes, and to communicate progress to broader stakeholders.
Pros: Our simulation approach is more scalable than the other test types. Though we still have a lot to improve on, and there’s still a manual evaluation step in the middle, we invested great effort to streamline the overall process by writing good quality code in a production-style “pipeline” which orchestrates the different steps: running the utterances through the new bot, preparing the responses for manual evaluation, and computing the results afterwards. Simulations are also quantitative, not just qualitative. Our large-ish dataset (ca. 1000 utterances) is sampled to reflect typical distribution of usecase domains in production. This more realistically represents the way customers talk to us, and the problems they have.
Cons: It’s expensive, thanks to all those LLM calls and, more importantly, the annotator effort. Another issue is that there’s no ground truth for a natural language answer. That makes automated evaluation tricky, and even manual evaluation is subjective and ambiguous.
But a much bigger problem is that we can't test multi-turn utterances. We’re passing customers’ first utterances to our new bot, and unless it answers very similarly to the old bot (which it ideally won’t), the customers’ historic second utterances will no longer make sense. We could try having an LLM play customer and chat with our new bot, but it would be expensive and not a particularly realistic test, given that our customers have different spoken styles, dialects, and problems, than whatever data ChatGPT and co. have been trained on.
A knock-on effect of the first utterance problem is that we can't test things like conversation repair, which is when a customer changes their mind during a chat. So we can’t yet get a full picture of how the bot is behaving over entire conversations. There’s also a “log-in barrier,” whereby for most first utterances, the appropriate bot response is to have the customer log in. Our WIP bot typically gets this right, but it’s an easy test, which doesn’t teach us much.
Lessons Learned:
Frequent and early communication among testers is critical: Our evaluation sessions are live group efforts, where evaluators share any tricky utterance-response pairs, in order to get second opinions on how to rate them. This helps resolve ambiguities and ensure alignment. We also document the tricky cases alongside our evaluation guidelines, making future evaluations faster and more consistent. It also helps us keep track of where we’re really struggling to implement use cases in a satisfactory way.
Actionable takeaways:
A blend of different test types are key: In addition to the test types described here, we plan to try employee testing: letting colleagues from other departments try out our WIP bot with any scenarios they can think of (rather than specific scenarios like we used in our internal testing). This should provide us with a large, diverse, and more realistic set of test results, given that employees like call centre agents know exactly how customers tend to communicate with us. Gathering feedback will also be cheap and easy, using something like a Google form.
We'd also like to try some automated evaluations such as RAGAS, a suite of LLM answer quality metrics, some of which are evaluated using other LLMs. Of course, we’ll have to weight up cost versus reliability and convenience. But at least for the RAG part of the bot, we believe it’s worth a try.
What’s next?
Having run multiple tests over multiple sprints now, our biggest learning is that you’ll never think of everything in advance: Customers and chatbots will always surprise you. That’s why we’ll keep on having regular test retros (“retrospectives”), looking for ways to improve our test process every time.
I’m planning more posts about this, as we continue to learn. So stay tuned for those, and in the meantime, check out my last posts in this conference recap series, where I talked about AI strategy and trends, tips for building safer, better LLM systems, and a mathematician’s predictions for how data science, LLMs and multi-modal models can help us tackle global ethical and environmental issues.
RAG, short for “Retrieval Augmented Generation”, is an LLM design pattern wherein documents that are similar to a user’s query are passed along with that query to the bot, providing it with extra information which may help it answer the request.
Hi Katherine, thank you for sharing this. I’m curious, how many days or sprints did this entire process take? Would it have been easier to gather relevant feedback to evaluate your LLM if you had quickly shipped the MVP to a very small but significant and low-risk set of customers as a pilot, and then made decisions based on that?
Also, I assume you’re working in a product-based company that has the flexibility to invest in this type of research and testing. Do you have any insights on how agencies, which are building LLM bots for clients with much shorter deadlines (perhaps just a few days to a week at most), might go about testing and evaluating LLM outputs?