LLMs as Judges: Practical Problems and How to Avoid Them
Concrete advice for teams building LLM-powered evaluations
My last post was all about conceptual problems with using Large Language Models to judge other LLMs. In it, I presented the âgotchasâ that teams should watch out for when building LLM-powered products. Of course, the point is not to say that all LLM evaluations are bad, or that human judges are always better. There are definitely situations where an LLM judge makes sense, and todayâs post is for those about to dive in to using them. Itâs a heads up on some of the practical issues that lie ahead; more âgotchasâ to be aware of as you get started. Especially for teams who are building their own LLM-evaluation pipelines, this ought to save you a few headaches.
Non-Determinism in the Judge LLM
This is one of my biggest concerns about LLM judges. The LLM youâre trying to test is non-deterministicâââthatâs exactly why you want to test it in the first placeâââbut so is the judge! Perhaps you donât see that as a problem: sure, the LLM under test is powering a live system and so needs to be trustworthy and predictable, but the LLM judge is only intended to be a guide, to help you monitor and improve the live system. So a little unpredictability isnât so bad, right?
The problem is that in the real world, this âguidanceâ often becomes final. Why is that? For one thing, development teams are often too busy and overloaded with decisions to challenge LLM judge outputs. For another, itâs just so damn tempting to build pipelines on top of automated LLM evaluations. For example, what if you could select the best prompt or model to deploy automatically, based on the aggregated results of an automated evaluation? It sounds great, and in fact, triggering deployments based on automated metrics was common with machine learning models before. For example, my old team would retrain thousands of models every day, but only deploy the new versions if their R2 score was better than the previous dayâs model. So I donât see this practice stopping now.
Now, back to why this non-determinism in the judge LLM is a problem. Letâs say you want your judge to rate LLM outputs on a 5-point scale. To do this properly, the judge needs to see multiple samples at once to establish what âgoodâ versus âbadâ looks likeâââjust like teachers grading essays for a whole class. But this creates a number of issues:
First, you need long context windows to show the judge enough samples. Unfortunately, LLMs tend to forget content in the middle of very long contexts, losing exactly the subtle differences needed to establish a proper rating scale. Admittedly, model context windows are increasing, but performance drops over long contexts remain an issue. See this video, for example.
Second, because the judge is non-deterministic, the scale changes every time you run an evaluation with a new batch of samples. This makes it impossible to benchmark progress over time or monitor production systems consistently (let alone use them as a deployment trigger!).
You could try giving the judge only one sample at a time, but then what does â3 out of 5â even mean without comparative context? You could define a detailed rubric in the prompt, but if your prompt is flawed, your evaluations will be, too.
Other rating schemas will bring their own problems: Binary evaluations (like Yes/No or Good/Bad) and three-point scales (like Good/Acceptable/Bad) can lose important nuance, and require rationales that humans have to read. This just makes the process tedious, expensive, and prone to automation bias, where human evaluators accept whatever the judge LLM says.
Now, letâs not forget that human evaluators can also be inconsistent. But at least human evaluators talk to each other. They can debrief before, during and after an evaluation session, to get a sense of how well theyâre aligned with one-another. Itâs also an argument for turning LLM tasks into classification problems (as I discussed last time). This way, you can invest the effort to create a labelled benchmark dataset (using inter-annotator agreement to tackle human non-determinism and bias), and use simpler, deterministic evaluation metrics (such as Precision and Recall) after that. What you get: reusable effort, instead of repeated risk.
Propagation of Prompting Errors
If your LLM is performing a classification step, then thereâs another issue to be aware of, when testing how your bot makes decisions. Allow me to explain. Imagine your system has a prompt that guides the LLMâs classification behaviour. This prompt tells it which actions to choose, targets to set, etc, given different types of requests. Now you want to evaluate whether your model is actually following those instructions correctly. So you create an LLM judge and describe the same task to it, then have it evaluate whether your original model did a good job. But of course, any weaknesses or blind spots in your original prompt are going to be transferred directly to your judge model prompt. Youâre not getting a second opinionâââyouâre just doubling your possibility of errors while giving yourself a false sense of security.
Before you say, âbut of course thatâs ludicrous, who does that?â, you wonât believe what Iâve heard at conference coffee breaks! The interesting thing is, this might work, if youâre using a more powerful model as your judge. The stronger model might be better able to interpret your flawed prompt than the weak model itâs judging. But most teams with access to a powerful model would rather use it in their product, not their judge, right? So most teams are using an equally powerful model, maybe even the exact same one, for both output and evaluation. The only situation I can imagine where the production model might be weaker than the judge is if its task is simple, like classification, and you found that a weaker, but also faster and cheaper model, does the job. But even then, beware that if the judge is from the same model family, itâs still going to have some bias towards agreeing with what the original model did.
The Challenge of Choosing the Right Testing Scope
Imagine youâre testing a conversational AI system, like a customer service chatbot. Say you try to condense your entire system prompt into an LLM evaluator prompt, and then ask that LLM evaluator to rate chatbot transcripts accordingly. The problem here is that the task youâre asking the LLM judge to do is potentially far more complex than what the original was tasked with. Say the original chatbot performs many small, focused steps at different steps in the conversation. Those individual tasks are probably quite straightforward. But using a judge LLM to then rate the entire conversation is like trying to solve a really complex problem in a single step. Itâs the equivalent of telling a software developer to âsolve it in a one-liner.â
A better approach might be to evaluate the chain of tasks individually, such as having the LLM judge each conversational step one at a time. But this doesnât necessarily solve the problem: Later conversational turns will only make sense given the entire context that proceeded it, and the judge LLM might still need to understand how the entire system should work, in order to accurately assess a single step.
Another solution could be to use a more powerful model as judge. But as we just saw, itâs unlikely that your situation would lend itself to that.

Bias
One of the arguments for using LLMs as judges is to escape the biases that humans might accidentally bring to the task. Even decision fatigue, which isnât a bias but can be just as problematic, can be avoided via automated evaluations. But donât forget that LLMs bring their own biases, too. Some of these are well known: sex-based biases in various NLP processes is a long-studied problem. But there are plenty more, and some are downright silly. For example:
Position bias: LLMs favour responses based on their placement, often preferring the first or last option presented (some teams tackle this by running multiple evals in different orders, but that adds a lot of time and cost, and only adds noise to the non-determinism problem).
Verbosity bias: Longer answers may get higher scores, even when theyâre not more accurate or helpful.
Self-enhancement bias: Models favour text generated by LLMs from the same family or provider (using GPT-4 to judge GPT-3.5, for example).
Such biases donât just downgrade the quality of your evaluation outputs; they can make the entire evaluation process flaky. If you change the judge model, for example, you could get completely different results. Think you can avoid it by sticking with one LLM? Tell that to all the teams who just spent the last 6 months frantically experimenting with new models afte OpenAI announced it would sunset GPT 4.
Wonât LLM Improvements Just Fix These Problems?
An obvious counter-argument to my points above are that LLMs are improving so fast, surely these kinks will be ironed out soon enough, right? Surely better prompt engineering, few-shot learning, and model calibration will solve these problems?
Maybe. But letâs not forget: LLMs are amazing already, and yet these problems still exist. If LLMs could get so good at following instructions that these testing issues would disappear entirely, then we probably wouldnât need to do testing at allâââand I donât think thatâs happening in any realistically short timeframe. After all, simple Machine Learning models are time-tested and predictable, and yet we still monitor them (and trust me, they still go wrong!).
Cost Considerations: The saving grace for LLM Judges?
There is one area where LLM-based evaluations could still win out: cost-effectiveness, especially for large-scale testing, and especially compared to the cost of human evaluations. After all, engineers and conversation designers are expensive (and for good reason!).
What Iâd love to see is a large-scale study comparing the one-off costs of using humans to label a dataset versus the ongoing cost (in both money and energy consumption) of regularly reusing a judge LLM. Since both approaches offer their own benefits, whatâs the typical ROI for each? What such a study needs to consider is that hand-labelling can be quicker than you think (I knowâââmy last team regularly did it), and it brings lots of benefits (again, see âUsing one LLM to Judge Another? Hereâs Five Reasons You Shouldnâtâ for more). Anyone got any recommended reads?
Final Thoughts
Iâve been pretty down on LLM evaluation tools in this and my last post. Last time it was conceptual issues, this time it was the practical problems: non-determinism, scoping challenges, prompting hurdles and bias. My goal is not to disregard this technology altogether; itâs simply to give teams a heads-up, as they start exploring the solution landscape and even building their own LLM-based evaluations, of some of the gotchas they should be aware of. And itâs also a reminder, that old-school data science and ML Ops approaches are still there, theyâve proven their worth, and they have established docs and best practices to help you use them (for more on this, see, âYes you still need NLP skills in the âAge of ChatGPTââ.