Skip to content

Case study

Document Pipeline

Document intelligence for a small business — intake and triage, then high-precision financial extraction at production scale.

Year
2023—
Status
live
Role
Architect & Implementer
Stack
Python, Tesseract, Pydantic, Gemini, GPT-4o, Landing AI, SQLite, Notion API

Incoming paper — mail, supplier invoices, daily POS and card-settlement reports — was processed and typed into spreadsheets by hand, so sales were only worked through twice a year. This pipeline triages every document with an LLM and reads roughly fifteen business documents a day into a typed, reconciled database — turning a twice-yearly look at the books into article-level performance visible within days.

At a glance: incoming paper flows through one AI extraction engine into reconciled, bookkeeping-ready data.
At a glance

Before → after

Daily business documents
Beforetyped into spreadsheets by hand
Afterread automatically, ~15/day
Sales insight
Beforereviewed every ~6 months
Aftervisible within days
Data-entry labour
Before~€5,000 / year (loaded)
After~€0.14 per paid extraction

~3,500

invoices processed

~15,700

line items

~15/day

business documents

€0.14

per paid extraction

6 months → days

sales-data visibility

The technical breakdownExpand · ~7 min

The business problem

A small business and the household behind it generate two streams of paper. The first is incoming post — invoices, official letters, dunning notices, tax mail — that has to be opened, understood, classified, filed, and acted on before a deadline passes. By hand that ran 5–10 hours a month, and the financial documents were the unforgiving part: a missed invoice is a missed due date.

The second stream is heavier and operational: the daily business documents — point-of-sale daily-close reports, article-level sales reports, card-settlement summaries, supplier invoices. About fifteen of these land every day. Each one used to be typed into spreadsheets by hand, line by line, and that cost was high enough that the numbers were only worked through every six months. New articles went in on instinct — two quarters could pass before the figures showed whether one was a top seller or dead stock. For a small business that lag is expensive: it is capital sitting in the wrong inventory, found out months too late.

A figure you mistype is an error you find months later. A sales report you never type is a decision you make blind.

The reading is fuzzy and human; the classifying, filing, and routing are rules you can state exactly. The system is built on that seam — a wide triage front door for the mail, and a high-precision extraction engine for the daily business documents behind it.

The architecture

The pipeline runs in two stages over one shared filesystem. Stage 1 is the letter-opener: it opens, reads, and triages incoming post. Stage 2 — Landing AI, the pre-stage to the books — turns the daily business documents into accounting-grade, reconciled data.

Two-stage flow. Stage 1 opens and triages every incoming PDF and files household mail to Notion; business documents hand off to Stage 2 (Landing AI), which classifies into typed schemas, routes by cost tier (free local parsing vs. the paid document-native model), and reconciles into bookkeeping-ready data.
Two-stage flow. Stage 1 opens and triages every incoming PDF and files household mail to Notion; business documents hand off to Stage 2 (Landing AI), which classifies into typed schemas, routes by cost tier (free local parsing vs. the paid document-native model), and reconciles into bookkeeping-ready data.

Stage 1 — the letter-opener. A watched folder; each PDF is keyed by content hash (the cache key for both OCR and extraction), OCR'd with Tesseract, then read by an LLM into a ~24-field Pydantic schema behind a multi-backend provider abstraction that runs one primary with one automatic fallback. Deterministic Python — never the model — decides which context a document belongs to and whether it is business-relevant. Household and other paperwork is filed under a recipient/year/category/ hierarchy and synced to Notion. Stage 1 is mostly mail, but scanning is human: a business document occasionally lands in the wrong intake folder. The same content classification catches it and reroutes it to Stage 2 rather than mis-filing it, so a misplaced scan does not become a missing entry in the books.

Stage 2 — the accounting pre-stage. This is where the roughly fifteen business documents a day become bookkeeping data. Each is first sorted by a keyword-scoring classifier into one of four document types — invoice, daily-close, article report, card-settlement — each with its own typed schema. A cost-aware three-tier router then decides how to read it, and results are stored zlib-compressed in SQLite, archived by year/month/type.

TierToolWhen it's used
Localpdfplumbertext-readable PDFs — free
Document-nativeLanding AI ADEscans that defeat pdfplumber — paid
Legacyretired OCR servicehistorical data, migrated in once

Route each document to the cheapest tool that can actually read it — and pay for the expensive one only when it earns its keep.

Reconciliation. Extracted line items are matched against a supplier-article catalog (in Supabase), so the same product reconciles across vendors whatever a given invoice calls it — turning raw extractions into comparable, queryable data rather than disconnected rows. That reconciliation is what makes article-level performance answerable on demand instead of once every six months.

Decisions & trade-offs

LLMs extract facts; deterministic code makes decisions. The model reads; rule-based Python classifies by business context and routes. Asking the model which entity an invoice belongs to would be asking it to guess, so it never does.

Fallback on validation failure, not just on API errors. A provider can return a confident, well-formed answer that still doesn't match the schema — the dangerous case, because a plausible-but-wrong extraction sails into the books. So a result is validated before it's accepted, and a validation failure falls through to another provider. The trade-off is a second call's latency and cost, bought for guaranteed structure.

The provider migration was a decision, not a default. What decided it wasn't headline price but accuracy-per-cleanup: a cheaper extraction that needs an afternoon of data-cleaning isn't cheaper.

ProviderRoleWhat decided it
Dedicated OCR servicebaseline (retired)raw labelled boxes, no schema fit → heavy manual cleanup
gpt-4ocost probeclean JSON on the first pass, but fumbles low-quality scans
Landing AI ADEproductionaccurate, no formatting cleanup needed; pay-per-page

What broke

The migration benchmark was meant to produce a clean accuracy table; it produced the answer by failing instead. The dedicated OCR service returned raw labelled boxes that didn't map to the target schema at all — every line item came back unmatched without a custom normalization layer on top. In production that meant a backfill re-extracting 10,842 line items from ~1,800 documents the old service had captured as text but never structured, plus ~70 merchant-name spelling variants to reconcile. The general LLM was usable but not flawless on bad scans — a date read a year wrong, a vendor mangled into OCR noise. And the intake schema guarantee is honest-but-imperfect: validation drives the fallback at the read boundary, but the dict finally written passes through a lighter coercion layer, so the guarantee is weaker at the write boundary.

The outcome

In use since 2023 across versions, with the move onto Landing AI made in late 2025. The extraction engine runs across ~3,500 invoices and ~15,700 line items, most parsed free locally and only the hard few hundred sent to the paid model at roughly €0.14 each.

Two burdens lifted. The mail intake replaced 5–10 hours a month of opening, sorting, and filing with structured, deadline-tracked records. The heavier one is the daily business documents: about fifteen a day that previously had to be typed into spreadsheets by hand. Checked against real examples — a daily-close report, an article-sales report, a multi-line wholesale invoice — three to five minutes of careful entry each is if anything conservative, since a long invoice whose positions have to be matched against the article catalog runs longer. Fifteen a day is on the order of an hour daily, five to six hours a week. Valued at Germany's statutory minimum wage plus the roughly 30% employer on-costs — about €18 an hour — that is on the order of €400–470 a month, near €5,000 a year, of pure data entry. Offshoring it is cheaper per hour, but the documents are German and every extracted figure has to be re-validated by a German speaker, so the validation overhead eats most of the saving. The pipeline removes the task rather than relocating it.

The larger win is not the hours, though — it is the decision they were blocking. Sales that were worked through every six months because the manual entry was prohibitive are now reconciled as the documents process, so whether a newly introduced article is a top seller or dead stock is visible within days instead of two quarters later. For a small business deciding what to stock, that is the difference between steering and guessing.

The honest gaps: extraction accuracy is high in practice but unmeasured (no evaluation layer yet), and the whole thing runs by hand on one machine rather than on a schedule with observability — the next steps if it ever has to leave the laptop.