Insights

How I Evaluate AI Systems

A model's headline benchmark is not its value. What matters is whether its output survives contact with the rest of the system — structure, cost, and the cleanup it forces downstream.

The benchmark is not the system

When I had to choose an extraction provider for a production document pipeline, the headline accuracy number turned out to be the least useful figure I had. A model can score well on a public benchmark and still be the wrong choice, because a benchmark measures the model in isolation and a production system never runs it in isolation. What matters is whether the output survives contact with everything downstream: the schema it has to fit, the budget it runs against, and the manual cleanup a wrong answer forces a human to do later.

So I do not evaluate a model. I evaluate the model in the seam where it sits.

What I actually compare

Three things, in this order.

Structure. Does the output fit a strict schema every time? In the pipeline, every extraction has to land in a typed Pydantic schema. A provider that returns fluent, correct-looking prose but will not reliably produce the schema has failed the only test that matters for an automated system — the next step cannot consume it.

Accuracy per cleanup, not headline accuracy. The real cost of a wrong extraction is the human time it takes to find and fix it downstream. A cheaper provider that needs an afternoon of data-cleaning is not cheaper. When I compared a dedicated OCR service, a general-purpose LLM (gpt-4o), and a document-native model, the decision came down to this: the OCR service returned raw labelled boxes that fit no schema without a custom normalization layer; the general LLM produced clean JSON but fumbled low-quality scans; the document-native model was accurate enough to need no formatting cleanup at all. Price per page was the tie-breaker, not the deciding factor.

Failure behaviour. Does it fail loudly or quietly? A model that throws an error is safe — the system stops and falls back. A model that returns a confident, well-formed, wrong answer is the dangerous one, because nothing downstream knows to distrust it.

Validate before you trust

That last point drives a concrete design rule: fall back on validation failure, not just on API errors. Every result is validated against its schema before it is accepted, and a validation failure falls through to another provider — the same path an outright error would take. The cost is a second call's latency; what it buys is that a plausible-but-wrong extraction does not silently enter the books.

A confident answer that fails the schema is more dangerous than an error. An error stops; a plausible-but-wrong value flows downstream and surfaces weeks later.

The honest gap: I have not measured it yet

Here is what I will not pretend: in that pipeline, extraction accuracy is high in practice but unmeasured. There is no evaluation layer yet — the validation rule above catches malformed output, not plausible-but-wrong output. That is the real limitation, and naming it is part of the evaluation.

If I were to build the eval — and for a client system I would — it would be four pieces:

A golden set. A few hundred documents with hand-verified field values, deliberately weighted toward the hard cases (poor scans, unusual layouts), not just the easy ones.
Field-level scoring. Exact match for structured fields — dates, totals, tax — and fuzzy match for names; precision and recall reported per field, never collapsed into one global percentage that hides where it actually fails.
A per-provider, per-tier breakdown, so that swapping a provider becomes a measurable regression test rather than a judgement call.
A cleanup log. Every manual correction recorded, because the correction rate per field is the accuracy-per-cleanup metric — it closes the loop back to the only cost that matters.

Evaluation is the part most AI demos skip. It is also the part that decides whether a system can be trusted to run unattended. The framework above is what I bring to that question.

The worked example

Document Pipeline

← All insights