Write the eval before the prompt: eval-driven development for AI features
TDD's failing-test-first loop, applied to AI: write the eval before the prompt, let the score say when an LLM feature actually works, and stop shipping changes you can't measure.
By TristanThe reflex is almost universal. You wire up a model, write a prompt, run it on one or two inputs that came to mind, eyeball the output, and tweak the wording until the demo looks right. Ship it. It feels like progress — the screen fills with plausible text and the thing appears to work.
It quietly isn't. "Tweak the prompt until the demo works" has no definition of done; it has a definition of looked fine once. You never said what correct means, so you can't tell whether the next edit improved the feature or just moved the failure somewhere you didn't look. And because a language model's output drifts from run to run, the input you didn't try — or the one that passed yesterday — may already be broken. The regressions are real. You just never see them.
Builders solved this exact shape of problem a long time ago, in ordinary software, and the answer was test-driven development: write the failing check first, then make it pass. The same move works on AI features, and it has a name worth saying out loud — write the eval before the prompt.
What an eval actually is
An eval is a scored check on an AI feature's behavior: a set of inputs, the outputs your system produces for them, and a grader that says how good those outputs are. It's the AI-shaped cousin of a unit test, and the two differences are the whole point.
First, the output is non-deterministic. A unit test asserts that
add(2, 2) returns 4, and it returns 4 every time, forever. A prompt asked
the same question twice can answer two different ways — both fine, or one fine
and one not. So you stop thinking about a single assertion and start thinking
about a distribution of behavior across many cases. One green run proves
nothing; the shape of the scores over a set is the signal.
Second, the result is scored, not binary. "Correct" for a generated summary, a classification, or an agent's plan isn't always a clean true/false. Sometimes it is — valid JSON or not, the right label or the wrong one — and you should grab that certainty whenever the output allows it. But often quality lives on a gradient, and your grader has to produce a number you can compare, not just a pass stamp.
Hold those together and the unit is simple: input → output → grader. That triple is to an AI feature what a test case is to a function.
# one eval case, framework-neutral
input: "Refund the $40 order paid by card"
output: (whatever the model returns)
grade:
- parses as a RefundAction
- action is "refund"
- amount is 40.00Write that down and you've defined what the feature is supposed to do. Skip it — write the prompt first — and you've defined nothing, which is exactly why prompt-first feels productive and ages so badly.
Eval-driven development is TDD for AI
Kent Beck's loop is three beats — red, green, refactor — and it ports to AI features almost without translation.
Red: write the failing eval first. Before you touch the prompt, write the cases. Pin down a handful of representative inputs and what a good output looks like for each, then run them against whatever you have — an empty prompt, a stub, last week's version. It should fail, or score low. That failing score isn't a setback; it's the first time the feature has had an honest definition of done.
Green: build the minimal prompt or agent to pass it. Now write the smallest prompt, the simplest tool loop, the cheapest model that gets the score up. Not the cleverest prompt you can imagine — the one that passes. The eval, not your taste, decides when you're done. This is the step that kills bikeshedding: you stop arguing about prompt wording in the abstract, because the score settles it.
Refactor: iterate against the score. With a green bar you can finally change things safely. Swap the model for a cheaper one and watch the score. Trim the prompt and watch the score. Add a retrieval step and watch the score. Every change is now a measurement instead of a vibe, and the eval catches the regression the instant a "harmless" edit breaks a case you'd forgotten about.
The punchline is the same one that runs through the field guide to AI coding agents: a system is only as good as its ability to check its own work. An agent that can run your tests has a ground truth to move toward; a feature with an eval has the same. A good prompt is mostly a good definition of done — and the eval is that definition, written down and executable.
The grading ladder
The hard part of an eval is the grader, and graders come in rungs. Climb only as high as the output forces you to — cheaper, more deterministic grading is always better when you can get it.
| Rung | What it grades | Best for | The catch |
|---|---|---|---|
| Exact / structured assertions | parses, schema-valid, equals, contains, matches a rule | classification, extraction, tool calls — anything with a checkable shape | only works when "correct" is literally checkable; says nothing about quality |
| LLM-as-judge | fuzzy quality — faithfulness, helpfulness, tone, "did it answer" | summaries, open answers, anything on a gradient | the judge has biases, can be gamed, and needs its own checking |
| Regression set | did a known past failure come back | locking in every bug you've already fixed | only as good as the cases you remember to add |
Exact and structured assertions are the bottom rung and the one to reach for first. If the feature emits JSON, assert it parses and matches a schema. If it classifies, assert the label. If it calls a tool, assert the arguments. These checks are fast, free, deterministic, and impossible to argue with — the same certainty an ordinary unit test gives you. The limit is honesty: they tell you the shape is right, not whether the content is any good. A perfectly formatted, confidently wrong answer sails straight through.
LLM-as-judge is how you grade what a regex can't. You hand a second model the input, the output, and a rubric, and ask it to score — is this summary faithful to the source, did this answer actually resolve the question, is the tone right. It's powerful, and it's the rung to be most suspicious of. Judges have biases: they can favor longer answers, or whatever sounds confident, or outputs that resemble their own style. They can be gamed — optimize hard enough against a judge and you'll learn to please the judge instead of the user. Treat it as a component that itself needs evaluating: spot-check its scores against human judgment, keep its rubric concrete, and prefer pairwise comparisons ("is A better than B") over absolute scores, which drift. Never cite a judge's number as if it were ground truth.
The regression set is the rung that compounds. Every time a real failure slips through — a hallucinated field, a refusal that shouldn't have happened, an edge case nobody imagined — you turn it into a case and add it to the set. The suite stops being something you wrote once and becomes a growing memory of every mistake the feature has ever made, the way a good bug tracker turns incidents into regression tests. It's the cheapest rung to maintain and the one that quietly saves you the most, because the bug you already fixed is the one most likely to creep back after an innocent prompt edit.
Most real features use all three: structured assertions for the parts with a right answer, a judge for the parts without one, and a regression set growing underneath both.
Where it shines, and where it bites
Eval-driven development pays off most where behavior is scoped, repeatable, and has a checkable output: extraction, classification, routing, structured generation, retrieval answers you can check against a source, tool-calling agents whose actions you can assert on. Anything where you can say, concretely, what a good output contains is squarely in the strike zone — the same property that makes a task safe to hand a coding agent.
It bites where "good" is genuinely subjective and open-ended — a brainstorm, a poem, a long free-form draft where ten different outputs are all fine and the difference is taste. You can still eval the guardrails there: it stayed on topic, it didn't leak the system prompt, it held the format. But don't pretend a score captures the quality. Forcing a number onto something inherently subjective just buys you a confident metric that measures the wrong thing.
And sometimes an eval is overkill. For a throwaway script or a one-shot internal tool you'll run twice, a quick smoke check — does it run, does the output look sane — is the honest amount of rigor. The discipline is knowing the difference, not applying maximum ceremony to everything.
Gate every change, mine every failure
The reason to pay the upfront cost is that an eval suite is an asset that appreciates. The loop has two halves.
Build the eval set once, then make it the gate. Wire the suite into CI so no prompt edit, model swap, or dependency bump merges without running it. This is what turns "we think the new model is better" into "the score went up on the set we agreed matters." A CI-friendly harness — promptfoo is one open-source example — runs your cases on every change the same way your unit tests do.
Trace production and feed the failures back. Your eval set only knows the inputs you thought of; production knows the ones you didn't. Tracing tools such as Langfuse capture real prompts, outputs, and costs on live traffic, so when something breaks in the wild you can lift the offending case straight out of the trace and drop it into the regression set. Real failures become permanent tests. (For retrieval features, metrics-focused tools like Ragas score the RAG-specific things — faithfulness, answer relevance — that a generic grader misses.)
That's the flywheel, and it's the same one the evals and observability layer of the 2026 open-source AI stack is built around: trace everything, gate every change, and let production teach the eval set. Each turn makes the next change safer to ship. Skip it, and every prompt tweak is a fresh roll of the dice.
The cache
A few keepers, the way we keep everything here — examples, not endorsements.
- A version-controlled eval harness beats a spreadsheet. Whether it's promptfoo or something you grow yourself, the win is the same: cases in source control, run on every change, diffed like code. The tool matters less than the habit.
- Treat the judge as code that needs review. LLM-as-judge is a real technique — the model labs' own writing on evals and judging is worth reading for the patterns — but a judge you never check is just a second model you're trusting blindly. Calibrate it against human ratings before you lean on its scores.
- The best eval case is a real bug. The highest-value cases aren't the ones you brainstorm up front; they're the failures you catch in production and refuse to let happen twice. Mine your traces.
The names will drift — the harnesses, the tracing tools, the judge prompts will all look different in a year. The discipline won't. Writing the eval first is just TDD's oldest lesson in new clothes: you don't actually know what "working" means until you've written down how you'd check it, and you can't improve what you refuse to measure.
So before the next prompt, write the eval. Let the score, not the demo, decide when you're done — and stop shipping AI changes you can't measure.