You shipped something with an LLM. Users seem happy. But "seems happy" is not a metric, and you have no idea whether last week's prompt change made things better or worse.
This is where most teams are. They know evaluation matters. They skip it anyway. The honest reason? Evaluation is genuinely hard. Building a good eval set takes real effort, the tooling has a learning curve, and shipping feels more urgent than measuring. So teams run on vibes until something breaks badly enough to force the issue.
But the tools have caught up. LLM-as-judge has become production-ready. The methodology for building evaluation sets is now well-documented. And skipping evaluation is increasingly expensive as AI products mature and small quality regressions compound.
The Spectrum: Vibes to Production
Vibes sit at the bottom. You try some prompts, they feel good, you ship. Everyone starts here, and it's fine for prototyping. The problem is that vibes don't scale. What feels good to you might fail on edge cases you never imagined. And you have no baseline to compare against when you make changes.
Vibes are the default because they're fast. They become dangerous when you mistake them for real evaluation.
Benchmarks offer something more objective. You run your model on MMLU, HumanEval, or whatever standard test is relevant to your domain. You get a number.
The problem, as we've covered in our analysis of benchmark gaming, is that benchmark scores often don't predict production performance. A model that scores 3% higher on HumanEval might be worse for your specific use case. Benchmarks test narrow capability slices under controlled conditions. Your production environment doesn't look like a benchmark. That said, they're a rough filter. If a model tanks on HumanEval, you probably shouldn't use it for coding tasks. They just don't tell you whether it will work for your specific workflow.
Human evaluation is the gold standard because humans are ultimately what you're trying to satisfy. The problems are scale and cost. Anthropic's engineering team notes that human evaluation doesn't scale; you can't have humans rate every output in a high-throughput system. And hiring qualified evaluators for specialized domains (medical, legal, technical) is expensive and slow.
Human evaluation works best as a calibration layer: use it to build your ground truth dataset, then automate from there.
LLM-as-Judge Has Grown Up
This is where the field has matured significantly. You use one language model to evaluate the outputs of another.
The counterintuitive finding:
GPT-4 achieved 85% agreement with human experts on MT-Bench, exceeding human-human agreement of 81%.
LLM judges can be more consistent than humans for certain tasks. But they have documented failure modes. The same research found significant biases: position bias (Claude showed 70% first-position preference), verbosity bias (over 90% preference for longer responses), and self-enhancement bias (GPT-4 favored its own outputs by 10%). Hallucination detection is particularly weak; the best models achieved only 58.5% accuracy on factual versus hallucinated summaries.
The key to making LLM-as-judge work is methodology. Hugging Face's research found that improved prompt design increased human-judge correlation from 0.567 to 0.843, nearly a 30% improvement. The specific recommendations: use small integer scales (1-4) instead of floats, add a chain-of-thought "Evaluation" field before the final score, provide clear rubrics for what each score means, and start with roughly 30 human-labeled examples to calibrate. Confident AI's research adds that few-shot prompting increased GPT-4's consistency from 65% to 77.5%, and that swapping output positions helps detect position bias.
Production metrics sit at the top of the evaluation hierarchy, though they're messier than clean scores. Task completion rates, error patterns, user corrections, support tickets, churn. A model that scores 95% on your eval set but confuses 20% of your users is failing. The challenge is attribution: when a user churns, was it the AI? When they contact support, was it a model failure or a UX problem? Production metrics require careful instrumentation and often can't isolate the AI component cleanly.
Building Your First Eval Set
If you're building with AI and have no evaluation system, Anthropic recommends starting with 20-50 tasks derived from actual user failures. Not hypothetical edge cases. Real failures from your logs.
The reasoning: these are the cases that matter most. If your model fails on tasks where users already had problems, you're compounding frustration. Fix the known failures first.
For task complexity, Anthropic distinguishes three types: single-turn evals (one prompt, one response, simplest to build), multi-turn evals (conversation flows where context matters), and agent evals (multi-step workflows with tool use and branching logic).
Most teams should start with single-turn evals on their highest-impact failure modes, then expand from there.
Whatever approach you take to building your eval set, you need a way to score outputs. Anthropic identifies three grader types: code-based graders for deterministic checks (did the output contain required fields? did it match the expected format?), model-based graders for nuanced, open-ended responses, and human graders for subjective quality calibration. The practical approach is layering: code-based checks for structural requirements, LLM-as-judge for semantic quality, and human evaluation for calibrating your automated graders.
Pass@k and Pass^k
Pass@k measures the probability of getting at least one success in k attempts. Useful for tasks where you can retry, like code generation with multiple samples.
Pass^k measures the probability of succeeding on all k attempts. This is reliability. A model might pass@1 at 90% but pass^3 at only 70%, meaning it fails at least once in 30% of three-attempt sequences.
For production systems, reliability often matters more than peak performance. A model that succeeds 95% of the time on each attempt might be preferable to one that succeeds 98% but fails catastrophically in the remaining 2%.
The infrastructure for evaluation has matured. Tools like DeepEval, Arize, and Braintrust provide structured frameworks for running evals at scale: dataset management, grader orchestration, result tracking, and regression detection.
Even with mature tooling, most teams skip systematic evaluation. The reasons are structural. Eval datasets are proprietary; as Eugene Yan notes, "nobody is open-sourcing their evaluators." Your evaluation set is competitive advantage. This means there's no shortcut; you have to build your own.
Shipping feels more urgent than measurement infrastructure. This is a false economy; unmeasured changes accumulate technical debt that compounds. And the learning curve is real; setting up LLM-as-judge with proper calibration takes time.
The counter-argument is simple: if you don't measure, you don't know. You might be shipping improvements. You might be shipping regressions. Without evaluation, you're gambling.
Our read: The tooling is no longer the bottleneck. Building good eval sets still requires domain expertise and effort, but the excuse that "eval infrastructure is too hard to set up" doesn't hold anymore. Start with 20-50 real failure cases. Set up an LLM-as-judge with proper prompt design. Calibrate against human labels. Track pass@k and pass^k. This isn't comprehensive, but it's enough to stop flying blind. The tools exist. The methodology is documented. The only thing missing is the decision to do it.