LLMs as Judges: Practical Problems and How to Avoid Them
Concrete advice for teams building LLM-powered evaluations
My last post was all about conceptual problems with using Large Language Models to judge other LLMs. In it, I presented the βgotchasβ that teams should watch out for when building LLM-powered products. Of course, the point is not to say that all LLM evaluations are bad, or that human judges are always better. There are definiβ¦



