Essay
June 2026 · 8 min read
When the Model Is a Draft, Not the Source of Truth
Designing the boundary where probabilistic output ends and durable product truth begins.
- AI Tooling
- Data Integrity
- Platform Engineering
The Problem
ResearchLog is a system I built to turn everyday engineering activity into R&D tax-credit evidence. It ingests pull requests and issues from GitHub, routes sync and classification jobs through a background queue, and asks Claude to evaluate each activity against the IRS four-part test: permitted purpose, technological in nature, elimination of uncertainty, and process of experimentation. The output is a structured classification with a confidence score, a narrative, and a list of disqualifiers. Eventually that evidence lands in front of a CPA.
That last sentence is the design constraint that shaped everything else. A tax filing is a place where mostly right is a liability. The model's draft is genuinely useful: it reads a PR's title and metadata and produces a plausible, structured judgment in seconds. Plausible and durable are different properties, though, and an audit workflow cannot let one quietly become the other.
The hard part of building this system was not calling the model. The API call is the easy five percent. The hard part was deciding, explicitly and in code, where probabilistic output stops and product truth begins. Most of the architecture, from schema validation and write semantics down to the dashboard math, is an answer to that one question.
The Trap: A Clean-Looking Upsert
The naive pipeline is easy to write and looks correct. An activity syncs in, a job fires, the model classifies it, and the result upserts into a classifications table keyed by activity. Sync again next week and the job refreshes the row with a newer draft. Idempotent and tidy.
The failure shows up the first time a human disagrees with the model. A manager reviews a classification, decides the model got it wrong, and overrides it. Some time later a webhook fires or a backfill runs, the automatic path re-classifies the activity, and the upsert replaces the manager's judgment with a fresh draft. Nothing crashes or logs an error. The dashboard keeps showing plausible numbers. The system has silently destroyed the most valuable data it had: a human decision, which is exactly what an auditor would ask to see.
This is a worse class of bug than a 500, because the loss is invisible and the lost data is irreplaceable. Once you see it, the requirement becomes clear: automatic classification and human review are different kinds of writes, and they cannot share one code path with one set of semantics.
ResearchLog ended up with two deliberately asymmetric write paths. The automatic path yields to humans. If a row is marked user_override = true, the job skips it entirely. The manual path, where a reviewer explicitly asks for a fresh AI draft, refreshes the AI-generated fields while preserving the override flag and the reviewer's notes at write time, so a concurrent human decision is never collateral damage.
The Boundary: Model Output Is Untrusted Input
The second decision was to treat the model like any external client. Nothing it produces is trusted until it passes a contract.
The contract lives in one place. A Zod schema defines what a classification response is: the qualification boolean, a confidence integer from 0 to 100, four-part scores, a bounded narrative, a capped list of disqualifiers, and a QRE category from a fixed enum. The JSON Schema sent to the model as its tool definition is derived from that same Zod schema, so the shape we request and the shape we are willing to persist cannot drift apart. Every response is parsed before persistence, and a malformed response fails loudly instead of writing garbage.
The boundary runs in both directions. Prompt inputs are bounded too. Titles and descriptions are capped, and the full PR body and the nested provider payloads are deliberately excluded from the prompt. Part of that is cost control. The larger part is blast radius: a prompt that accepts arbitrary repository content is a prompt whose behavior you cannot reason about, and an output column without a length cap is a denial-of-service vector aimed at your own dashboard.
This is the same discipline as validating a webhook payload. The model does not get a pass on it because its output reads well.
Probabilistic
- GitHub PRs + issues
- Inngest jobs
- Claude draft
The contract
- bounded prompt inputs
- JSON Schema tool definition
- Zod parse before persist
Durable truth
- RPC override guard
- RLS-scoped writes
- user_override = true
The Invariant Lives in the Database
Schema validation catches malformed output. It cannot catch well-formed output arriving at the wrong moment. The race is plain: an automatic job reads a row, sees no override, and calls the model. In the seconds that takes, a manager overrides the classification. The job's write now clobbers a decision the job never saw.
Application-level checks cannot close that gap, because read, check, and write are separate steps across a network. So the rule moved into the database, where every write path has to pass through it. A single RPC performs the automatic-path write, and the guard travels inside the statement:
-- Condensed: the automatic path may insert or refresh a row,
-- but the override guard is part of the atomic statement.
update classifications
set qualifies = p_qualifies,
confidence = p_confidence,
narrative = p_narrative
where activity_id = p_activity_id
and user_override = false;If a human got there first, the statement matches zero rows and the draft is discarded. There is no window where the application's stale view of the row matters, and the invariant holds for every caller, including the background job someone adds in two years without reading the original design discussion.
That is the real argument for enforcing this in the database. Rules implemented in application code last until someone refactors around them. The database outlives the refactor.
Security Layers and Product Layers Fail Differently
ResearchLog's authorization is layered conventionally. User-driven API requests run on a Supabase client scoped by the user's JWT, so row-level security applies even if an application-level filter regresses someday. The service-role client, which bypasses RLS, is reserved for jobs, webhooks, and system persistence, where no user context exists. Manager-only endpoints check membership roles on top of that.
During manual smoke testing, the first pass that ever put two users inside the same organization, this layering caught a real bug. The shape of the bug is the interesting part.
Signed in as an engineer, the UI showed manager-only controls: override buttons and the component creation form. The API was not fooled. Every write came back 403, exactly as designed. But the interface was promising actions the system would refuse, and that is its own kind of correctness failure.
There was no security hole. Row-level security was doing what it was designed to do: members of an org may read their co-members' membership rows, because the product needs that. The frontend's membership query simply fetched memberships without scoping to the signed-in user. In a multi-member org it received the manager's row alongside the engineer's, resolved the current role with a first-match lookup, and picked the wrong one. The fix was one line, plus a regression test pinning the contract:
.from('org_memberships')
.select('org_id, role')
.eq('user_id', session.user.id) // scope to the signed-in userThe lesson I took from it: backend authorization and product-level role gating are related layers with different jobs. RLS answers what a user may read and write, and says nothing about what a user should be shown. A query can be perfectly secure and still feed the UI data that mis-gates the product. Both layers need their own tests, and the second one only gets exercised when test data includes the multi-user shapes production will actually have.
Making Uncertainty Visible
The last boundary is the interface. If the model's output is a draft, the UI has to present it as one.
Every classification in the review feed carries its confidence, its four-part scores, its disqualifiers, and the business component the model inferred. These are the inputs a reviewer needs in order to agree or push back, and they sit next to the accept and override controls that record the decision. The review screen is internal tooling in the truest sense: its user is the person whose judgment the system exists to protect.
The dashboard partitions work into three mutually exclusive buckets: qualifying, review needed, and disqualified. Every activity lands in exactly one, so nothing is double-counted and low-confidence work cannot hide in two places at once. The review bucket is a queue someone is expected to empty. The activity list also deliberately excludes raw GitHub payloads, so the UI receives the fields the workflow needs instead of a megabyte of provider JSON.
The point of all this is to lower the cost of inserting a human at the right moment. In a workflow like this, that is worth more than a smarter model.
What I Would Carry Forward
The architecture reduces to a sentence: models draft, humans decide, databases enforce, and the UI makes uncertainty visible. The longer version:
- Decide explicitly where model output stops being a suggestion. If that boundary is not written down, your upsert path has already decided it for you.
- Validate model output like untrusted client input, and derive the model's output schema and your runtime validation from one definition so they cannot drift.
- Put invariants that protect human judgment in the layer every write must pass through. For most products that is the database.
- Treat human decisions as the most expensive data in the system. Any code path that can touch them should have to prove it preserves them.
- Test security semantics and product semantics separately, with data shaped like production, including the multi-user shapes that only appear after launch.
AI-assisted workflows raise the bar for systems engineering, because the new component is fluent and confidently wrong at unpredictable moments. The interesting work is everything wrapped around the model call. That work is ordinary engineering, and it decides whether anyone can trust what ships.