How I Evaluate AI Systems
A model's headline benchmark is not its value. What matters is whether its output survives contact with the rest of the system — structure, cost, and the cleanup it forces downstream.
The benchmark is not the system
When I had to choose an extraction provider for a production document pipeline, the headline accuracy number turned out to be the least useful figure I had. A model can score well on a public benchmark and still be the wrong choice, because a benchmark measures the model in isolation and a production system never runs it in isolation. What matters is whether the output survives contact with everything downstream: the schema it has to fit, the budget it runs against, and the manual cleanup a wrong answer forces a human to do later.
So I do not evaluate a model. I evaluate the model in the seam where it sits.
What I actually compare
Three things, in this order.
Structure. Does the output fit a strict schema every time? In the pipeline, every extraction has to land in a typed Pydantic schema. A provider that returns fluent, correct-looking prose but will not reliably produce the schema has failed the only test that matters for an automated system — the next step cannot consume it.
Accuracy per cleanup, not headline accuracy. The real cost of a wrong extraction is the human time it takes to find and fix it downstream. A cheaper provider that needs an afternoon of data-cleaning is not cheaper. When I compared a dedicated OCR service, a general-purpose LLM (gpt-4o), and a document-native model, the decision came down to this: the OCR service returned raw labelled boxes that fit no schema without a custom normalization layer; the general LLM produced clean JSON but fumbled low-quality scans; the document-native model was accurate enough to need no formatting cleanup at all. Price per page was the tie-breaker, not the deciding factor.
Failure behaviour. Does it fail loudly or quietly? A model that throws an error is safe — the system stops and falls back. A model that returns a confident, well-formed, wrong answer is the dangerous one, because nothing downstream knows to distrust it.
Validate before you trust
That last point drives a concrete design rule: fall back on validation failure, not just on API errors. Every result is validated against its schema before it is accepted, and a validation failure falls through to another provider — the same path an outright error would take. The cost is a second call's latency; what it buys is that a plausible-but-wrong extraction does not silently enter the books.
A confident answer that fails the schema is more dangerous than an error. An error stops; a plausible-but-wrong value flows downstream and surfaces weeks later.
The honest gap: I have not measured it yet
Here is what I will not pretend: in that pipeline, extraction accuracy is high in practice but unmeasured. There is no evaluation layer yet — the validation rule above catches malformed output, not plausible-but-wrong output. That is the real limitation, and naming it is part of the evaluation.
If I were to build the eval — and for a client system I would — it would be four pieces:
- A golden set. A few hundred documents with hand-verified field values, deliberately weighted toward the hard cases (poor scans, unusual layouts), not just the easy ones.
- Field-level scoring. Exact match for structured fields — dates, totals, tax — and fuzzy match for names; precision and recall reported per field, never collapsed into one global percentage that hides where it actually fails.
- A per-provider, per-tier breakdown, so that swapping a provider becomes a measurable regression test rather than a judgement call.
- A cleanup log. Every manual correction recorded, because the correction rate per field is the accuracy-per-cleanup metric — it closes the loop back to the only cost that matters.
Evaluation is the part most AI demos skip. It is also the part that decides whether a system can be trusted to run unattended. The framework above is what I bring to that question.