Document Pipeline
Document intelligence for a small business — intake and triage, then high-precision financial extraction at production scale.
- 2023—
- live
- Architect & Implementer
- Python, Tesseract, Pydantic, Gemini, GPT-4o, Landing AI, SQLite, Notion API
Incoming paper — mail, supplier invoices, daily POS and card-settlement reports — was processed and typed into spreadsheets by hand, so sales were only worked through twice a year. This pipeline triages every document with an LLM and reads roughly fifteen business documents a day into a typed, reconciled database — turning a twice-yearly look at the books into article-level performance visible within days.
- typed into spreadsheets by hand
- read automatically, ~15/day
- reviewed every ~6 months
- visible within days
- ~€5,000 / year (loaded)
- ~€0.14 per paid extraction
~3,500
~15,700
~15/day
€0.14
6 months → days
The technical breakdown
The business problem
A small business and the household behind it generate two streams of paper. The first is incoming post — invoices, official letters, dunning notices, tax mail — that has to be opened, understood, classified, filed, and acted on before a deadline passes. By hand that ran 5–10 hours a month, and the financial documents were the unforgiving part: a missed invoice is a missed due date.
The second stream is heavier and operational: the daily business documents — point-of-sale daily-close reports, article-level sales reports, card-settlement summaries, supplier invoices. About fifteen of these land every day. Each one used to be typed into spreadsheets by hand, line by line, and that cost was high enough that the numbers were only worked through every six months. New articles went in on instinct — two quarters could pass before the figures showed whether one was a top seller or dead stock. For a small business that lag is expensive: it is capital sitting in the wrong inventory, found out months too late.
A figure you mistype is an error you find months later. A sales report you never type is a decision you make blind.
The reading is fuzzy and human; the classifying, filing, and routing are rules you can state exactly. The system is built on that seam — a wide triage front door for the mail, and a high-precision extraction engine for the daily business documents behind it.
The architecture
The pipeline runs in two stages over one shared filesystem. Stage 1 is the letter-opener: it opens, reads, and triages incoming post. Stage 2 — Landing AI, the pre-stage to the books — turns the daily business documents into accounting-grade, reconciled data.
Stage 1 — the letter-opener. A watched folder; each PDF is keyed by content hash (the cache key for both OCR and extraction), OCR'd with Tesseract, then read by an LLM into a ~24-field Pydantic schema behind a multi-backend provider abstraction that runs one primary with one automatic fallback. Deterministic Python — never the model — decides which context a document belongs to and whether it is business-relevant. Household and other paperwork is filed under a recipient/year/category/ hierarchy and synced to Notion. Stage 1 is mostly mail, but scanning is human: a business document occasionally lands in the wrong intake folder. The same content classification catches it and reroutes it to Stage 2 rather than mis-filing it, so a misplaced scan does not become a missing entry in the books.
Stage 2 — the accounting pre-stage. This is where the roughly fifteen business documents a day become bookkeeping data. Each is first sorted by a keyword-scoring classifier into one of four document types — invoice, daily-close, article report, card-settlement — each with its own typed schema. A cost-aware three-tier router then decides how to read it, and results are stored zlib-compressed in SQLite, archived by year/month/type.
| Tier | Tool | When it's used |
|---|---|---|
| Local | pdfplumber | text-readable PDFs — free |
| Document-native | Landing AI ADE | scans that defeat pdfplumber — paid |
| Legacy | retired OCR service | historical data, migrated in once |
Route each document to the cheapest tool that can actually read it — and pay for the expensive one only when it earns its keep.
Reconciliation. Extracted line items are matched against a supplier-article catalog (in Supabase), so the same product reconciles across vendors whatever a given invoice calls it — turning raw extractions into comparable, queryable data rather than disconnected rows. That reconciliation is what makes article-level performance answerable on demand instead of once every six months.
Decisions & trade-offs
LLMs extract facts; deterministic code makes decisions. The model reads; rule-based Python classifies by business context and routes. Asking the model which entity an invoice belongs to would be asking it to guess, so it never does.
Fallback on validation failure, not just on API errors. A provider can return a confident, well-formed answer that still doesn't match the schema — the dangerous case, because a plausible-but-wrong extraction sails into the books. So a result is validated before it's accepted, and a validation failure falls through to another provider. The trade-off is a second call's latency and cost, bought for guaranteed structure.
The provider migration was a decision, not a default. What decided it wasn't headline price but accuracy-per-cleanup: a cheaper extraction that needs an afternoon of data-cleaning isn't cheaper.
| Provider | Role | What decided it |
|---|---|---|
| Dedicated OCR service | baseline (retired) | raw labelled boxes, no schema fit → heavy manual cleanup |
gpt-4o | cost probe | clean JSON on the first pass, but fumbles low-quality scans |
| Landing AI ADE | production | accurate, no formatting cleanup needed; pay-per-page |
What broke
The migration benchmark was meant to produce a clean accuracy table; it produced the answer by failing instead. The dedicated OCR service returned raw labelled boxes that didn't map to the target schema at all — every line item came back unmatched without a custom normalization layer on top. In production that meant a backfill re-extracting 10,842 line items from ~1,800 documents the old service had captured as text but never structured, plus ~70 merchant-name spelling variants to reconcile. The general LLM was usable but not flawless on bad scans — a date read a year wrong, a vendor mangled into OCR noise. And the intake schema guarantee is honest-but-imperfect: validation drives the fallback at the read boundary, but the dict finally written passes through a lighter coercion layer, so the guarantee is weaker at the write boundary.
The outcome
In use since 2023 across versions, with the move onto Landing AI made in late 2025. The extraction engine runs across ~3,500 invoices and ~15,700 line items, most parsed free locally and only the hard few hundred sent to the paid model at roughly €0.14 each.
Two burdens lifted. The mail intake replaced 5–10 hours a month of opening, sorting, and filing with structured, deadline-tracked records. The heavier one is the daily business documents: about fifteen a day that previously had to be typed into spreadsheets by hand. Checked against real examples — a daily-close report, an article-sales report, a multi-line wholesale invoice — three to five minutes of careful entry each is if anything conservative, since a long invoice whose positions have to be matched against the article catalog runs longer. Fifteen a day is on the order of an hour daily, five to six hours a week. Valued at Germany's statutory minimum wage plus the roughly 30% employer on-costs — about €18 an hour — that is on the order of €400–470 a month, near €5,000 a year, of pure data entry. Offshoring it is cheaper per hour, but the documents are German and every extracted figure has to be re-validated by a German speaker, so the validation overhead eats most of the saving. The pipeline removes the task rather than relocating it.
The larger win is not the hours, though — it is the decision they were blocking. Sales that were worked through every six months because the manual entry was prohibitive are now reconciled as the documents process, so whether a newly introduced article is a top seller or dead stock is visible within days instead of two quarters later. For a small business deciding what to stock, that is the difference between steering and guessing.
The honest gaps: extraction accuracy is high in practice but unmeasured (no evaluation layer yet), and the whole thing runs by hand on one machine rather than on a schedule with observability — the next steps if it ever has to leave the laptop.