LLMs as Judges: Practical Problems and How to Avoid Them

Concrete advice for teams building LLM-powered evaluations

Sep 03, 2025

A collection of sticky notes and emojis related to evaluation (graphs, ticks and crosses, etc) — All images: Author provided.

My last post was all about conceptual problems with using Large Language Models to judge other LLMs. In it, I presented the “gotchas” that teams should watch out for when building LLM-powered products. Of course, the point is not to say that all LLM evaluations are bad, or that human judges are always better. There are definitely situations where an LLM judge makes sense, and today’s post is for those about to dive in to using them. It’s a heads up on some of the practical issues that lie ahead; more “gotchas” to be aware of as you get started. Especially for teams who are building their own LLM-evaluation pipelines, this ought to save you a few headaches.

Non-Determinism in the Judge LLM

This is one of my biggest concerns about LLM judges. The LLM you’re trying to test is non-deterministic — that’s exactly why you want to test it in the first place — but so is the judge! Perhaps you don’t see that as a problem: sure, the LLM under test is powering a live system and so needs to be trustworthy and predictable, but the LLM judge is only intended to be a guide, to help you monitor and improve the live system. So a little unpredictability isn’t so bad, right?

The problem is that in the real world, this ‘guidance’ often becomes final. Why is that? For one thing, development teams are often too busy and overloaded with decisions to challenge LLM judge outputs. For another, it’s just so damn tempting to build pipelines on top of automated LLM evaluations. For example, what if you could select the best prompt or model to deploy automatically, based on the aggregated results of an automated evaluation? It sounds great, and in fact, triggering deployments based on automated metrics was common with machine learning models before. For example, my old team would retrain thousands of models every day, but only deploy the new versions if their R2 score was better than the previous day’s model. So I don’t see this practice stopping now.

If you repeat an evaluation, you want a deterministic result. That’s a problem for LLMs.

Now, back to why this non-determinism in the judge LLM is a problem. Let’s say you want your judge to rate LLM outputs on a 5-point scale. To do this properly, the judge needs to see multiple samples at once to establish what “good” versus “bad” looks like — just like teachers grading essays for a whole class. But this creates a number of issues:

First, you need long context windows to show the judge enough samples. Unfortunately, LLMs tend to forget content in the middle of very long contexts, losing exactly the subtle differences needed to establish a proper rating scale. Admittedly, model context windows are increasing, but performance drops over long contexts remain an issue. See this video, for example.
Second, because the judge is non-deterministic, the scale changes every time you run an evaluation with a new batch of samples. This makes it impossible to benchmark progress over time or monitor production systems consistently (let alone use them as a deployment trigger!).
You could try giving the judge only one sample at a time, but then what does “3 out of 5” even mean without comparative context? You could define a detailed rubric in the prompt, but if your prompt is flawed, your evaluations will be, too.
Other rating schemas will bring their own problems: Binary evaluations (like Yes/No or Good/Bad) and three-point scales (like Good/Acceptable/Bad) can lose important nuance, and require rationales that humans have to read. This just makes the process tedious, expensive, and prone to automation bias, where human evaluators accept whatever the judge LLM says.

Now, let’s not forget that human evaluators can also be inconsistent. But at least human evaluators talk to each other. They can debrief before, during and after an evaluation session, to get a sense of how well they’re aligned with one-another. It’s also an argument for turning LLM tasks into classification problems (as I discussed last time). This way, you can invest the effort to create a labelled benchmark dataset (using inter-annotator agreement to tackle human non-determinism and bias), and use simpler, deterministic evaluation metrics (such as Precision and Recall) after that. What you get: reusable effort, instead of repeated risk.

Propagation of Prompting Errors

If your LLM is performing a classification step, then there’s another issue to be aware of, when testing how your bot makes decisions. Allow me to explain. Imagine your system has a prompt that guides the LLM’s classification behaviour. This prompt tells it which actions to choose, targets to set, etc, given different types of requests. Now you want to evaluate whether your model is actually following those instructions correctly. So you create an LLM judge and describe the same task to it, then have it evaluate whether your original model did a good job. But of course, any weaknesses or blind spots in your original prompt are going to be transferred directly to your judge model prompt. You’re not getting a second opinion — you’re just doubling your possibility of errors while giving yourself a false sense of security.

Before you say, ‘but of course that’s ludicrous, who does that?’, you won’t believe what I’ve heard at conference coffee breaks! The interesting thing is, this might work, if you’re using a more powerful model as your judge. The stronger model might be better able to interpret your flawed prompt than the weak model it’s judging. But most teams with access to a powerful model would rather use it in their product, not their judge, right? So most teams are using an equally powerful model, maybe even the exact same one, for both output and evaluation. The only situation I can imagine where the production model might be weaker than the judge is if its task is simple, like classification, and you found that a weaker, but also faster and cheaper model, does the job. But even then, beware that if the judge is from the same model family, it’s still going to have some bias towards agreeing with what the original model did.

A series of Emojis: X, right arrow, X, repeat, X — Propagation of Prompting Errors: A great big conceptual problem

The Challenge of Choosing the Right Testing Scope

Imagine you’re testing a conversational AI system, like a customer service chatbot. Say you try to condense your entire system prompt into an LLM evaluator prompt, and then ask that LLM evaluator to rate chatbot transcripts accordingly. The problem here is that the task you’re asking the LLM judge to do is potentially far more complex than what the original was tasked with. Say the original chatbot performs many small, focused steps at different steps in the conversation. Those individual tasks are probably quite straightforward. But using a judge LLM to then rate the entire conversation is like trying to solve a really complex problem in a single step. It’s the equivalent of telling a software developer to “solve it in a one-liner.”

A better approach might be to evaluate the chain of tasks individually, such as having the LLM judge each conversational step one at a time. But this doesn’t necessarily solve the problem: Later conversational turns will only make sense given the entire context that proceeded it, and the judge LLM might still need to understand how the entire system should work, in order to accurately assess a single step.

Another solution could be to use a more powerful model as judge. But as we just saw, it’s unlikely that your situation would lend itself to that.

Scope is hard. For example, evaluating a single piece of the process may miss crucial overall context.

Bias

One of the arguments for using LLMs as judges is to escape the biases that humans might accidentally bring to the task. Even decision fatigue, which isn’t a bias but can be just as problematic, can be avoided via automated evaluations. But don’t forget that LLMs bring their own biases, too. Some of these are well known: sex-based biases in various NLP processes is a long-studied problem. But there are plenty more, and some are downright silly. For example:

Position bias: LLMs favour responses based on their placement, often preferring the first or last option presented (some teams tackle this by running multiple evals in different orders, but that adds a lot of time and cost, and only adds noise to the non-determinism problem).
Verbosity bias: Longer answers may get higher scores, even when they’re not more accurate or helpful.
Self-enhancement bias: Models favour text generated by LLMs from the same family or provider (using GPT-4 to judge GPT-3.5, for example).

Such biases don’t just downgrade the quality of your evaluation outputs; they can make the entire evaluation process flaky. If you change the judge model, for example, you could get completely different results. Think you can avoid it by sticking with one LLM? Tell that to all the teams who just spent the last 6 months frantically experimenting with new models afte OpenAI announced it would sunset GPT 4.

Models from the same family/provider might be biased to favour each other’s work.

Won’t LLM Improvements Just Fix These Problems?

An obvious counter-argument to my points above are that LLMs are improving so fast, surely these kinks will be ironed out soon enough, right? Surely better prompt engineering, few-shot learning, and model calibration will solve these problems?

Maybe. But let’s not forget: LLMs are amazing already, and yet these problems still exist. If LLMs could get so good at following instructions that these testing issues would disappear entirely, then we probably wouldn’t need to do testing at all — and I don’t think that’s happening in any realistically short timeframe. After all, simple Machine Learning models are time-tested and predictable, and yet we still monitor them (and trust me, they still go wrong!).

Cost Considerations: The saving grace for LLM Judges?

There is one area where LLM-based evaluations could still win out: cost-effectiveness, especially for large-scale testing, and especially compared to the cost of human evaluations. After all, engineers and conversation designers are expensive (and for good reason!).

What I’d love to see is a large-scale study comparing the one-off costs of using humans to label a dataset versus the ongoing cost (in both money and energy consumption) of regularly reusing a judge LLM. Since both approaches offer their own benefits, what’s the typical ROI for each? What such a study needs to consider is that hand-labelling can be quicker than you think (I know — my last team regularly did it), and it brings lots of benefits (again, see ‘Using one LLM to Judge Another? Here’s Five Reasons You Shouldn’t’ for more). Anyone got any recommended reads?

Final Thoughts

I’ve been pretty down on LLM evaluation tools in this and my last post. Last time it was conceptual issues, this time it was the practical problems: non-determinism, scoping challenges, prompting hurdles and bias. My goal is not to disregard this technology altogether; it’s simply to give teams a heads-up, as they start exploring the solution landscape and even building their own LLM-based evaluations, of some of the gotchas they should be aware of. And it’s also a reminder, that old-school data science and ML Ops approaches are still there, they’ve proven their worth, and they have established docs and best practices to help you use them (for more on this, see, “Yes you still need NLP skills in the ‘Age of ChatGPT’”.

Beyond the Buzzwords, with Katherine Munro

Discussion about this post