Steal My Idea: Evaluating LLM Systems with Production Data at Scale
A framework for fixing the gaps in your LLM-testing strategy

In my last post [1], I described how my team and I have been testing our WIP conversational assistant, despite having no baseline or benchmarks, and despite the LLM testing landscape being relatively immature. But thereβs still a gap when it comeβ¦


