Using one LLM to Judge Another? Here’s Five Reasons You Shouldn’t
Concrete ‘gotchas’ for people building things with Gen AI.
As teams across the world scramble to build things with Gen AI and Large Language Models, countless startups are scrambling to build LLM evaluation tools to serve them. Many such tools use LLMs to judge the system’s output, be it the final results or just the LLM-component’s responses. This makes sense, for some stages of the product development process. But with my experience building chat- and voicebots with LLM backends, I have serious concerns about the ‘ubiquitous utility’ of LLMs as judges. In this post, I’m sounding the alarm.
My goal is not to decry LLM-based evaluation tools altogether. Instead, I hope to get you thinking about all your evaluation options, rather than just the trendy and obvious ones. It’s about considering your needs pre- and post-deployment, understanding which metrics are useful at which stages, and how you can most efficiently access them.
Everything’s easy… in the beginning
Before I go further, let me be clear: I do believe LLMs as judges can be very useful for early stages of product development, such as when you’re trying to envision and validate a new idea, or when you’re rapidly testing and iterate towards a PoC. An LLM-powered evaluation tool can even be overkill at this stage; roleplaying your solution directly in a UI, like ChatGPT, Gemini or Claude, might be good enough. But past that point, if you can get an LLM-testing tool up and running quickly, then it can certainly help turbocharge your testing and accelerate development. For example, tools like LangSmith let you define custom prompts to evaluate LLM outputs, allowing you to probe into all kinds of aspects of your WIP system’s behaviour as you build it. Thus, they can help you:
Identify potential use cases and functionality to add
Uncover weak spots in your prompting, logic, and overall user experience
Reveal unexpected or unintended behaviours
Identify subtle characteristics in the LLM’s outputs, which human evaluators may not detect
Take on different personas, to discover issues that the evaluators might overlook.
Such automated evaluations can also be immune to the subjective nature of human judgments, and they’re likely to be cheaper than manual review by developers and subject matter experts. Sounds like a great deal, right?
Unfortunately, there are cons to all these pros. First, prompting the LLM judge in a way that produces useful insights can be as hard as prompting the model you’re actually trying to test. Second, LLM judges aren’t completely free of human biases and blind spots. After all, they were trained by humans. Third, there’s the effort of reviewing LLM judgments, which should not be underestimated. And finally, there’s the question of the appropriateness of the tool to the development stage at hand. For example, most companies working with Gen AI aren’t really discovering their problem space from scratch—instead, they’re using LLMs to enhance or overhaul existing systems — and so this early product development work has already been done. And as these teams go to production, the types of things LLMs are great at identifying, like subtle differences between runs, become less and less relevant, or simply less actionable.
The remainder of this post will detail these issues, as well as conceptual problems with the idea of using LLMs to judge other LLMs. My goal? To motivate you to look broadly — and even back to old-school data science evaluation tools and metrics — when deciding how to test your work-in-progress (WIP) LLM-powered product.
Conceptual Objections
What are we trying to learn? Don’t we know it already?
The first thing teams must ask themselves when considering using a judge LLM is: what are we trying to learn? Everybody knows LLMs are remarkably good at communication, and as your team continually builds and tests your new solution locally, you’ll also build trust in it. In my team, for example, we learned over time that changing the system prompt for our voicebot doesn’t have as catastrophic and widespread impact as we’d initially feared.
So once you’ve ‘gotten to know’ your model, what can you learn by asking an LLM to judge its output on general qualities, like tone? When building a conversational application, for instance, what exactly do you gain by asking “was this a good conversation?” The judge LLM will probably be biased and say “yes” anyway, especially if it’s a model from the same family as the model under test (using GPT 4 to judge GPT 3.5, for example). So all this does is confirm your positive impression of what LLMs can do.
Continuing the real-world examples, a colleague recently tried using an LLM-based evaluation tool to rate our chatbot’s responses according to a custom “friendliness” metric. But we all know that LLMs are sickeningly friendly; we don’t need to run 1000 samples through our system to prove it. And besides, our developers already have a feeling for whether it’s friendly enough, since they work on it every day! We can sum this up with our first ‘gotcha’: beware of fancy tools which don’t teach you anything you don’t already know.
What’s the actionable insight? What’s the next-step value?
But let’s say you’re also building a chatbot, you liked my colleague’s friendliness metric, and you’ve implemented it yourself. You to test your bot after a major update to its system prompt, and the results show that the new version is 3% less friendly than the old one. What on earth are you going to do with that information? How can you action it? All you can do is ask the bot to be nicer (also known as “prompt and pray”), but who knows if that’ll work: After all, an LLM doesn’t even know what friendliness is! It only knows how to imitate a friendly tone. So whether your bot is 90% or 93% friendly is not a helpful thing to know.
More generally, many LLM judges are designed to output a rationale about why they made some judgment. Again, that can be enlightening in your earliest development stages. But later on — especially after deploying your solution to production — reading a load of rationales can become exhausting, overwhelming and time-consuming. Such insights are also a lot harder to action than more simple, concrete, aggregate metrics: if your bot performs classification, for example, then reviewing a distribution of output labels and confusion matrices could be a lot quicker and easier. And if your classification accuracy is bad, the next, value-generating step is much clearer: fix it! Thus, we get to our second gotcha: beware any tool whose insights are interesting, but not actionable.
Is an LLM-powered evaluation even necessary?
In the rush to play with all these shiny new evaluation tools, some teams forget that simpler, and even more effective methods are available. For example, I’ve seen LLM’s used to detect the language of a customer-bot interaction. But that’s a slow and expensive way to go, considering that this is a pretty-well-solved problem (don’t believe me? My stupid-simple tutorial shows how easy it is to get decent results, and I’m sure you could slip such a lightweight statistical model into your evaluation stack without too much extra overhead).
Another issue is that when it’s so easy to define any kind of custom evaluation metric you want, your creativity and resourcefulness take a backseat. Do you really need an LLM to judge the conciseness of an answer? Would a simple character or token count be enough (especially once you’re in production, and all you need is a sanity check to ensure your latest update didn’t cause your model to fall silent or start blabbing uncontrollably)? And if you’re building a RAG system, how about you concentrate on your retrieval accuracy — for which there are plenty of existing and effective methods — before asking your judge LLM to start evaluating whole conversations and giving you piles of long-form responses to review? After all, it’s pointless tweaking your generated responses if your retrieval mechanism is no good. But if it is good, you can already have more confidence in the final results, considering that the generation part is something we all know LLMs are great at.
Our third gotcha, then, is to beware unnecessary LLM-powered evaluations, and always ask if a simpler method will do.
Are we forgetting to think for ourselves?
Despite all these problems, I’ve met plenty of developers who were all-in on using LLMs as judges. I suppose I’m not surprised; Engineers are used to the determinism that usually comes with a unit test. Pass. Fail. That’s it. They aren’t used to having to review stack traces and make a judgment about whether the result was acceptable or not. So who could blame them for trying to automate away the most ambiguous part of building a Gen AI-powered product? Spinning up a notebook with a LangSmith evaluation example is fun and easy, and as soon as it’s done, engineers can get on to the work of building the pipelines to send the LLM responses there at scale. But this is a missed opportunity: The developers need to think about the actionable insights such a pipeline would generate. In doing so, they’ll better understand the problem, and the data, and what it’s like to work with an LLM to build a great customer experience (i.e. what the CX designers are doing). This step is crucial for developing good product intuition, and I fear that LLM evaluation tools distract us from it.
Another issue is that LLMs have been trained with reinforcement learning, which generally works well in aligning their decisions with human preferences. But this gets tricky when a judgment requires expert knowledge. Take, for example, this research comparing the evaluations and feedback given by Subject Matter Experts (SMEs) and LLMs. It found that while the humans provided detailed, unique and context-specific feedback, including new insights drawn from their own experience, the LLMs gave general explanations and even recycled information from the outputs they were judging.
But the problem here goes further than LLM-judgments being potentially inferior to expert ones. The real risk is that our natural ‘automation bias’ can cause us to more readily trust the LLM judges, particularly when they align with the developers’ gut feeling about about the performance of the system under test. That is, if the developers think it’s working, and an LLM backs them up, they might be more likely to accept that judgment without question. But this can lead to a false sense of security, especially if you define your custom metrics too specifically. For example, say you use an LLM to assess your chatbot’s adherence to brand style guidelines, and apply it only to individual customer-bot conversation turns: you can get good results, thanks to the bot’s linguistic prowess, but these might mask deeper problems. That’s why simple, concrete, aggregatable metrics like customer satisfaction, number of conversation turns, and task completion rate are so important: a number is not an explanation; instead, it forces you to think through the problem for yourself, to identify possible causes and remedies.
This brings us to our fourth gotcha: beware any tool that automates away your natural curiosity, skepticism and creativity!
Summing Up
Development teams building LLM-powered products, like chatbots, voicebots, and other conversational use cases, need to go easy on using LLMs to judge them. That doesn’t mean avoiding them altogether, but rather, seeing through the hype, and thoroughly considering whether the tool is appropriate for the team’s specific needs at specific stages of the project. For example, these tools can be useful in the earliest discovery phase, but they can also be overkill. Later, as development and experimentation ramps up, they can generate valuable insights, but those come with certain ‘gotchas’. And finally, once a solution is in production, the verbosity of LLM-powered evaluations can become a burden, especially compared to simple, concrete, aggregate metrics.
Ultimately, how you evaluate your WIP LLM-powered solution is up to you. The final gotcha I can give is: beware of subjective metrics, and never forget specific ones — like customer satisfaction score — that reflect whether your solution is delivering real value to the end customer.
Liked this post? Last time I tackled another overlooked issue in evaluating LLM-powered systems: the need to evaluate multi-turn conversational interactions, with real production data, at scale. You can steal my proposed framework for solving this problem, by clicking this link.
Steal My Idea: Evaluating LLM Systems with Production Data at Scale
In my last post [1], I described how my team and I have been testing our WIP conversational assistant, despite having no baseline or benchmarks, and despite the LLM testing landscape being relatively immature. But there’s still a gap when it come…