Steal My Idea: Evaluating LLM Systems with Production Data at Scale
A framework for fixing the gaps in your LLM-testing strategy

In my last post [1], I described how my team and I have been testing our WIP conversational assistant, despite having no baseline or benchmarks, and despite the LLM testing landscape being relatively immature. But thereβs still a gap when it comeβ¦
Keep reading with a 7-day free trial
Subscribe to Beyond the Buzzwords, with Katherine Munro to keep reading this post and get 7 days of free access to the full post archives.