Where Not to Call the Model
On a real budget, the highest-leverage decision in an AI system is often not calling the model at all. Cost tracks the work you route to it — so route only what genuinely needs it.
The cheapest call is the one you never make
Most writing about AI cost is about picking a cheaper model or trimming tokens. In the systems I have built, the larger saving came from somewhere else: deciding, before any call, whether the model needs to run at all. Cost scales with the work you route to the model, so the highest-leverage component is often a cheap, deterministic gate that keeps most of the work away from it.
This is not premature optimization. It is the difference between a hobby script and something that can run every day without the bill becoming the reason you turn it off.
Two gates from real systems
Pixel-diff before vision OCR. In a handwriting pipeline, annotated tablet pages have to be read back by a vision model — the expensive step. But on a given day only a page or two of a twelve-page document is actually written on. So before anything reaches the model, the system pixel-diffs the clean and annotated PDFs and sends only the changed pages, typically one to three of twelve. Cost tracks edits, not document size.
A cost-aware tier router. In a document-extraction pipeline, a free local parser (pdfplumber) handles every text-readable PDF; only the scans that defeat it are escalated to a paid, document-native model; a third tier holds legacy data migrated in once. Each document is routed to the cheapest tool that can actually read it — and the paid model only ever sees the documents that genuinely need it.
Most of AI engineering on a budget is knowing where not to call the model.
The catch: the gate has to be cheap and right
A router is only a saving if the gate itself is cheap and reliable. If the test for "can the cheap path handle this?" is wrong, you have made things worse in one of two ways: you overspend by escalating documents that did not need it, or you send the model garbage the cheap path should have rejected. Both gates above work precisely because they are cheap and deterministic — a pixel comparison and a text-readable check, not another model judging whether to call the first model.
The honest limit: these thresholds are tuned to the documents I was handling, not to anything universal. The principle transfers cleanly — find the cheap, reliable signal that says "the expensive path is unnecessary here" — but the specific gate has to be rebuilt for each problem. That rebuild is usually an afternoon of work that pays for itself in the first month of running.