Lessons from systems in production.
Evergreen notes, not a feed. Each piece is drawn from a system I actually built and shipped, and links back to the case study as its worked example. The throughline: production AI needs the reliability rigor that safety-critical engineering brings.
How I Evaluate AI Systems
A model's headline benchmark is not its value. What matters is whether its output survives contact with the rest of the system — structure, cost, and the cleanup it forces downstream.
Where Not to Call the Model
On a real budget, the highest-leverage decision in an AI system is often not calling the model at all. Cost tracks the work you route to it — so route only what genuinely needs it.
Safety-Critical Rigor in Production AI
The discipline that keeps a regulated system safe — deterministic decisions, validation at every boundary, fail-safe defaults, a human in the loop for irreversible actions — is exactly what production LLM systems need.
Keeping Agentic Workflows Debuggable
When an LLM can take actions, the hard part is no longer capability — it is observability and bounded autonomy. (Forthcoming.)