← All writing AI Agents

Decompose First, Judge Last

Field notes on evaluating LLM systems in production: what actually catches failures, what quietly doesn't, and why most teams reach for the wrong tool first.

Read here Read on Medium ↗
TL;DR
  • The dividing line between AI teams that ship reliably and the rest is whether they can answer one question with a number: did your last prompt change make things better or worse?
  • Before you ask any model for an opinion, decompose the response into the largest set of small, deterministically verifiable checks and answer those with code. A judge tells you the answer feels worse; a deterministic check is a stack trace.
  • Reserve a calibrated, rubric-driven LLM judge for the subjective residue only, then wire the whole battery into CI so a quality regression blocks a release the way a broken test does.

There is one question that tells me more about an AI team's odds than any architecture diagram or model choice:

Did your last prompt change make things better or worse?

Teams that ship reliable AI answer with a number. Everyone else answers with a shrug, or worse, with confidence and no data.

I've spent enough time building and operating the evaluation layer for production LLM systems to be convinced that this single capability, measuring quality before you have a problem instead of after, is the dividing line between systems that survive contact with real users and systems that generate apology emails.

This article is the practical half of that conviction. Not a tour of evaluation theory, but the specific moves that worked, the failure modes that surprised me, and the one ordering decision that matters more than any tool choice: decompose first, judge last.

Why "It Feels Right" Stops Working

In 2023, nearly everyone evaluated AI output the same way. Change a prompt, run it a few times, eyeball the results, ship if nothing looks broken. The industry calls this the vibe check, and for a prototype it is genuinely fine.

The problem is that it breaks the moment you go to production, and it breaks in a way that is almost perfectly designed to hide from you.

LLM failures in production are usually semantic, not syntactic. Your API returns a 200. Your latency dashboard is green. What none of your traditional monitoring can see is that the model started hedging on answers it used to give cleanly, or restructured its JSON in a way that breaks a downstream parser, or quietly began interpreting "concise" as three paragraphs instead of three sentences.

Research on prompt brittleness has measured up to 76 accuracy points of variation from formatting changes alone, in few-shot settings on open models. The magnitude is the extreme case, but the direction of the finding holds everywhere: trivial-looking changes produce non-trivial behavioral shifts, and your infrastructure reports a healthy system the entire time.

This is why the consequences keep showing up as headlines rather than as alerts. CNET published finance articles riddled with AI-generated errors and had to issue corrections. Apple suspended its AI news summaries in early 2025 after they fabricated alerts. Air Canada was held liable for refund information its chatbot invented.

None of these were model-capability failures. They were evaluation failures: the system did something subtly wrong, and there was no automated layer positioned to catch "subtly wrong" before a human did.

A vibe check cannot scale past the point where one person can personally read every output. After that line, it is not a methodology. It is a hope.

The Stack, Briefly

A real evaluation framework has two layers, and they map onto practices you already trust.

Offline evaluation runs before you ship. Unit and integration testing for model behavior: every prompt change, model upgrade, or tool-definition tweak runs against a curated dataset and gets scored before it touches production. This is where the "better or worse" question gets answered, and where a bad change dies quietly in CI instead of loudly in front of a customer.

Online evaluation runs in production. Application performance monitoring, except it watches semantic quality instead of CPU. A sample of live traffic is scored continuously against the same metrics. When quality drifts below a threshold, it pages someone, the same way a latency spike would.

Offline catches the regressions you anticipated. Online catches the drift and the edge cases only real users produce. You need both, feeding each other: the most valuable additions to your offline dataset are the failures your online layer surfaces.

The Golden Dataset

Everything offline depends on one asset: a curated set of cases representing the real, diverse, and adversarial requests your system actually receives. Building a good one is the highest-leverage work in this discipline, and three principles separate the useful ones from the shelf-ware.

1. Start at fifty to a couple hundred cases, not thousands. The goal is coverage of your real distribution, not volume. Two hundred cases that include your genuine edge cases protect you better than ten thousand bland ones, and a small set is one your team will actually maintain.

2. Source from production traffic, not imagination. Cases invented in a conference room reflect how your team thinks the product is used. Cases in your logs reflect how it is actually used, and the gap between those two is exactly where you get hurt. Every confusing query and near-miss your online layer flags is a candidate for the golden set.

3. Treat it as a living asset with a named owner. It grows every time production teaches you something, and it is versioned alongside the prompts it validates, so a year from now you can answer not just "does this pass" but "which version of our quality bar did this pass against."

An unowned golden dataset goes stale in a quarter. And a stale dataset is worse than none, because it keeps emitting green checkmarks while measuring a product that no longer exists.

Decompose First, Judge Last

Here is where most teams go wrong, and it's an ordering mistake, not a tooling one.

The obvious objection to scoring two hundred cases on every change is that someone has to score them, and humans don't scale to continuous evaluation. The fashionable answer is LLM-as-judge: use a model to grade the model. Judges have a real place, and I'll get to it.

But if your evaluation strategy starts with a judge, you've reached for the noisiest, most expensive, least reproducible tool in the box before trying the sharp ones.

The move that matters most, and the one most teams skip: before you ask any model for an opinion, decompose the response into the largest possible set of small, deterministically verifiable questions, and answer those with code.

A holistic "rate this answer 1 to 5" is a blunt instrument. It compresses everything you care about into one fuzzy number, and when that number drops you have no idea why. Do the opposite. Break the same response into a battery of narrow checks, each with an unambiguous answer a machine can compute:

  • Is the output valid JSON against the expected schema? (parse it)
  • Are all required fields present and non-empty? (check them)
  • Is every monetary figure formatted to two decimals? (regex)
  • Does the cited figure match the value in the retrieved context? (compare against the source)
  • Is the response under the token budget your cost model assumes? (count)
  • Did it avoid language that signals an out-of-scope claim, like an unsolicited diagnosis? (a deny-list or a tightly scoped binary classifier)
  • Does every claim that should be grounded actually appear in the provided documents? (decompose into atomic claims, check each)

Each of these is a pass or fail you can verify without a model's opinion. Run them and you don't get a vague sense that quality slipped. You get exactly which property broke on which case.

That is a categorical difference. A judge tells you the answer feels worse. A deterministic check tells you schema validation failed on case 47 because the model started emitting a trailing field.

One sends you spelunking. The other is a stack trace.
◆ Background · decompose-then-verify in the research

The atomic-claim move has a lineage

The instinct to break one answer into many small checks is not just operational folklore; it is the design behind FActScore (Min et al., EMNLP 2023). The method decomposes a long-form generation into atomic facts, short statements each carrying a single piece of information, then scores the fraction supported by a reliable source. The paper's own motivation is precisely the article's point: "generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate." Later work like VeriScore extends the same decompose-then-verify pipeline. The lesson that carries over: the granularity of your checks is the granularity of your diagnostics.

In an earlier piece on prompts as technical debt, I described a healthcare migration where three things broke: the model offering unsolicited diagnoses, producing five times the tokens, and shattering the JSON parser. Notice that none of those required a sophisticated judge to catch. A length assertion catches the token explosion. A schema validator catches the JSON breakage. A scoped keyword guard flags the diagnostic language.

The most expensive failures in that story were deterministically detectable. The team found them in production only because nobody had decomposed "is this output acceptable" into checks that would have failed loudly in CI.

The deterministic layer is also what makes online monitoring tractable. Judging every live request with an LLM is slow and costly, and it gives you a trend line you still have to interpret. Running deterministic checks on live traffic is cheap, fast, and self-explaining: when the "JSON valid" rate drops from 99.9% to 96%, you know the shape of the problem before you open a log.

The principle: spend your determinism budget aggressively. Anything you can verify with code, verify with code. Escalate to a judge only for the residue that genuinely requires judgment.

Then, for the Residue, the Judge

Some qualities resist deterministic checking. Tone. Helpfulness. Whether an explanation is actually clear. This is the irreducible subjective remainder, and here a model judge earns its cost, provided you run it with discipline.

Three rules, each learned the annoying way.

A rubric, not a vibe. "Score 1 to 5 on factual accuracy, where 5 means every claim is supported by the provided context and 1 means it contains a fabrication that could mislead a user" is a rubric. "Rate the quality" is not. The first version of any rubric will fail in the same way: it encodes what you assumed mattered, not what production shows matters. Expect to rewrite it after you compare judge scores to real outcomes.

Reasoning before score. Force the judge to write its reasoning before it emits a number. The reasoning is both more reliable and auditable, and when a score surprises you, the reasoning is what tells you whether the judge or the rubric is wrong.

Calibrate against humans, then re-calibrate. Validate the judge against human ratings on a sample, so you know how well your automated grader agrees with the people it's replacing. Then check again after every major model or rubric change. Judge-human agreement is not a property you establish once. It decays.

One failure mode deserves naming: judges are biased toward their own family's style. A GPT-based judge prefers GPT-flavored writing; a Claude-based judge has its own aesthetics. The mitigation is a jury: judges from more than one model family, with disagreement surfaced as a signal worth a human's attention. It costs more. It also stops you from quietly optimizing your product to please one model's taste instead of your users.

◆ Background · self-preference bias is measured, not folklore

Why a single-judge setup quietly skews

The "judges prefer their own family" warning is backed by peer-reviewed work. Self-Preference Bias in LLM-as-a-Judge (NeurIPS 2024) found models such as GPT-4 systematically award higher scores to their own outputs. The proposed mechanism is revealing: LLMs over-reward text with low perplexity, the fluency of writing they are most likely to generate themselves, so a model's self-recognition capability tracks the size of the bias. The practical takeaway lines up with the article: a single judge does not give you a neutral grade, it gives you one model's taste, which is exactly why a multi-family jury is the mitigation rather than a luxury.

Once the battery exists, the whole thing snaps into CI. A prompt change opens a pull request, the harness runs the golden dataset, and the build fails if a quality metric regresses past tolerance. A quality regression becomes exactly as routine, and exactly as blocking, as a failed unit test.

That is the entire game: making semantic quality behave like a test you can fail.

The tooling is no longer exotic. DeepEval, Promptfoo, LangSmith and others cover offline and online. The hard part is the dataset, the rubrics, and the organizational will to let a quality gate block a release.

What Goes Wrong Even When You Do This

Field notes would be dishonest without the failure modes that show up after you build the harness.

Thresholds tuned for purity page you into numbness. Set your online alerts too tight and the on-call engineer learns within two weeks that "quality alert" means "noise," which is how a real regression eventually sails through a channel everyone has muted. Start loose, tighten deliberately, and treat every alert that didn't lead to action as a bug in the alert.

The gate gets bypassed exactly when it matters most. The first time a deadline collides with a failing eval, someone will propose shipping anyway and "fixing it after." If the answer is yes once, the harness is decorative. The organizational decision to let evaluation block a release is harder than any engineering in this article, and it is the actual difference between teams that have evaluation infrastructure and teams that have evaluation theater.

Passing evals get mistaken for a quality ceiling. The harness tells you that you haven't regressed against the cases you know about. It says nothing about the cases you don't. Teams that treat a green eval run as "the system is good" stop mining production for new failure cases, and the golden dataset quietly stops representing reality. The harness is a floor, never a ceiling.

The maturity trap is the trigger, not the tests. Teams consistently overestimate their evaluation maturity. The honest test: if a human has to remember to run your evaluation, you have batch testing, not evaluation infrastructure, no matter how good your dataset is. The jump that matters is removing the human from the trigger, so the "better or worse" question gets answered on every change whether anyone remembered to ask it or not.

The Industry Backdrop

You've likely seen MIT NANDA's much-quoted finding that 95% of enterprise GenAI pilots deliver no measurable P&L impact. The number deserves its caveats: the methodology has been criticized as preliminary, and "fail" specifically means no measurable financial return within a short window, not that the projects collapsed.

But the report's diagnosis is harder to argue with than its headline. The researchers attributed the divide not to model quality but to a learning gap: organizations unable to integrate AI into workflows in a way that compounds.

I'd push that one step further. A learning gap is, mechanically, a measurement gap. You cannot integrate what you cannot observe, and you cannot iterate toward reliability if every change is evaluated by a developer squinting at five outputs. The compounding the successful minority achieves has a concrete technical form, and it's the harness described above.

The pressure is no longer only internal. Gartner predicts that over 40% of agentic AI projects will be canceled by end of 2027, driven by governance and ROI failures, and projects AI regulation expanding to cover 75% of the world's economies by 2030. Under the EU AI Act's high-risk provisions, an auditor will be able to ask which model version was live when a decision was made and what its quality metrics were at deployment.

If your answer is a shrug, the shrug is now a liability.

Before You Scale, Build the Harness

If you take one thing from this: build the evaluation layer before you scale the feature, not after.

Curate one to two hundred real cases from your logs. Decompose "is this answer good" into the largest set of small, deterministic checks you can write, and verify those with code. Reserve a calibrated, rubric-driven judge for the subjective residue. Wire the battery into CI so a regression blocks a release the way a broken test does. Run the deterministic checks on a sample of production traffic so drift pages you with its shape already known.

It is unglamorous work. It is also the single highest-leverage investment a team moving from prototype to production can make, because it converts every future change from a gamble into a measurement.

The prompt was the beginning. The harness is what keeps it honest.