The Schema Passed. The Values Lied.
LLMs don't validate structured data. They validate structured data types. That distinction is critical to the resilience of your pipeline.
Stop babysitting dashboards. Ship from Slack. Touch grass.
Your media team opens Slack at 8am. There's a cross-platform brief in #growth: Google Ads spend vs. ROAS, Meta CPA by campaign, Stripe revenue by channel. Viktor posted it at 6am. Nobody asked for it.
Your strategist reviews spend trends. Your account manager checks revenue attribution. Same Slack channel, same colleague, before anyone's first coffee.
Google Ads, Meta, Stripe. One message. No Looker, no Data Studio. Anomaly detection runs around the clock. Cross-platform reporting runs on autopilot.
5,700+ teams. SOC 2 certified. Your data never trains models.
Let's Call It Sum-Of-Its-Parts Engineering
I closed last week's Bashmatica! with a simple thesis: probabilistic processes produce values that downstream consumers treat as computed, because they assume it had to be. Arithmetic is the clearest case of this, but hardly the only one. In fact, today's example is dangerous enough to unravel the most robust enterprise-grade pipelines if the distinction between probablism and determinism isn't addressed with the radioactive-handling protocols needed to ensure viability: structured data extraction.
A pipeline that asks an LLM for JSON gets JSON back. A validator runs, and it passes; then an orchestrator forwards the payload to the next stage. Every signal in the line says the contract held. The contract didn't hold; the validator just couldn't see what broke. JSON Schema validates that total_amount is a number. JSON Schema cannot validate that the number is correct, and in that way, the failure can cascade and threaten the integrity of every other link in the pipeline, because what's essentially dummy data survives automatic review.
Just like the 25-sum invoice from last week, here the output reads as authoritative because the wrapper is correct. The wrapper has nothing to do with the truth of its contents.
Companion script for this issue: schema-shadow. Audits LLM-emitted JSON for the failure modes JSON Schema cannot catch: silently dropped fields, type-coerced values, hallucinated entries, and field-order drift across calls. Details in the Quick Tip below.
For Further Reading
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models (Tam et al., 2024). The paper that quantified the format tax. Strict JSON-mode degrades reasoning, and the stricter the format, the larger the drop.
Structured outputs can hurt the performance of LLMs (Dylan Castillo, 2024). Independent reproduction: 92.68% to 65.85% on the BIG-bench Shuffled Objects task when GPT-4o-mini was forced into JSON-Schema mode.
JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models (2025). 10,000 real-world schemas; the load-bearing finding is that structural metrics (Path Recall, Type Safety) stay above 96% while Value Accuracy drops to 0.69 to 0.83.
Bashmatica! #012: Off by exactly $100. The arithmetic case from last week. Same failure shape, different layer.
Bashmatica! #008: When the Dashboard Says 94%. Coverage metrics that pass while the meaningful gap stays open. Schema validation is the same idea applied to data contracts instead of test suites.
It’s A Wrapper, Not A Contract
Structured output exists to remove ambiguity. You define a schema, constrain the model, and the response either parses it or it doesn't; downstream consumers get a clean object instead of a paragraph of prose to regex through. That chain of claims holds in the narrow sense that the JSON parses. Everything past parsing has to be re-verified, and almost nothing in the typical pipeline does that work, which is how the failure metastasizes from a single bad call into a full audit trail no one can faithfully reconstruct.
The first sign is in the format-restriction literature. Tam et al.'s 2024 study ("Let Me Speak Freely?", arXiv 2408.02442) separated two effects most people conflate: the prompt-level cost of asking for structured output, and the decoding-level cost of constraining the tokenizer. Both compound. Strict JSON-mode degrades reasoning across most of the open-weight model and task combinations Tam tested, and the degradation scales with format strictness. Dylan Castillo reproduced the finding on a public BIG-bench task: GPT-4o-mini scored 92.68% on Shuffled Objects in 3-shot freeform and 65.85% in 3-shot JSON-Schema mode. Same model, same prompt, same task; a 27-point drop introduced by the format alone.
The mechanism is straightforward. When the model has to commit to a schema's field order and type contract before it has finished reasoning, it loses its own output buffer as a working scratchpad. Strict JSON-mode is "answer first, reason never": the schema places the answer field early, and the model fills it before it produces any reasoning tokens at all. Tam's mitigation is the obvious one. Let the model reason in natural language first, then reformat into the schema in a second pass. That recovers most of the lost accuracy at the cost of doubling the inference bill, which is exactly why most pipelines skip it.
The second sign is more damning, because the wrong metric is the one being measured in the first place. JSONSchemaBench (2025) ran 10,000 real-world JSON schemas across the major constrained-decoding frameworks: 16 of 21 valid text models scored 96% or higher on Path Recall, Structure Coverage, and Type Safety. Same models, same tasks; Value Accuracy dropped to 0.69-0.83, and Perfect Response Rate (every field correct simultaneously) collapsed to 0.38-0.53. The shape passes; the contents are a coin flip. A pipeline that only validates the schema only sees the 96% number, and the 50% number ships silently underneath it.
The third sign is the one practitioners notice without naming. Prompt-only JSON extraction (no constrained decoding, just "respond in this JSON shape") fails to parse in 5-20% of calls, depending on schema complexity. That's the visible failure rate. The invisible rate, where the JSON parses cleanly but the values are wrong, is bounded below by the JSONSchemaBench numbers and is almost certainly larger. Almost no pipeline measures it; almost every pipeline ships it.
Contentinal Drift
The exposure is everywhere a downstream system trusts a field value because the JSON validated. Three concrete cases.
Field extraction from contracts, invoices, or compliance documents. The model returns a JSON object containing vendor_name, invoice_number, total_amount, due_date, and line_items[]. The schema validates. The vendor name and invoice number are correct. The total is the sum the model computed, which (per Issue #12) is the arithmetic failure itself. The due date pattern-matches "Net 30" against an invoice date the layout pipeline mis-cropped (per Issue #11). The line items array is correct in length 19 calls out of 20, and silently truncated to the first 11 entries on the 20th call, when the model decides 11 is enough to establish the pattern. Every one of those failures parses, none trip a schema validator, and every one ships into a downstream system that treats it as authoritative.
Tool-call arguments routed straight to APIs. Function-calling pipelines are the cleanest version of this failure, because schema validation is built into the framework. The model emits a tool call, the framework validates the arguments against the function signature, and the framework executes. When the model outputs {"tool": "search_transactions", "args": {"min_amount": 1000, "date_range": "last_quarter", "currency": "USD"}}, every argument is the right type and the schema passes. If the user asked about the previous fiscal quarter and the model bound last_quarter to the calendar quarter, the search returns the wrong rows; nothing in the pipeline knows, the report runs on the wrong window of time, and the variance discussion the next morning is about a number nobody can trace back to its source. Anthropic's own tool-use cookbook recommends a verification step for exactly this reason; almost no one implements it past the demo notebook.
Agent outputs feeding the next agent in a chain. Multi-agent pipelines are this failure at compounding scale. Agent A emits a JSON state object; Agent B consumes it. Agent A's schema is correct and 8 of 10 values are correct, and Agent B has no way to verify the 2 wrong fields because Agent A was the source of record. The errors compound in the dark, and by the time the final agent triggers an observable downstream effect (a transaction, a notification, a state change), the original wrong value is buried four schema-validations deep, and the path to root-cause runs through every agent's logs.
The common denominator across all three cases: schema validation produces a confidence signal that has no relationship to the truth of the data inside the schema. The pipeline trusts the wrapper. The wrapper has no opinion on the contents.
Many will argue (somewhat correctly) that this is what observability is for. Log every LLM call, log the parsed JSON, ship payloads to a data warehouse, sample them, build dashboards, alert on anomalies. The observability stack exists to catch this.
That objection isn't wrong, but it's also not what production pipelines actually run. The typical observability footprint for an LLM call captures input prompt, model name, token counts, latency, a parse-success boolean, and the raw response. None of those signals catch a total_amount off by $100, a silently truncated line_items, or a model that bound last_quarter to the wrong fiscal interval. That information only exists at the field level, after a value-correctness check runs; that check is the thing the pipeline doesn't have, and the dev who has to defend the variance next quarter doesn't know it's missing until it's too late.
The fix is the same as last week's deterministic-wall fix, just at a different layer. Schema validation tells you whether the wrapper is well-formed. Value validation, run as a separate step against an authoritative source, tells you whether the contents are correct. The two checks are independent, and pipelines that conflate them are betting that the model gets the values right at the rate the schema validator suggests; JSONSchemaBench just told us that bet pays off about half the time.
The mitigations cluster into three cases, in increasing order of cost and reliability.
The first is the reason-then-format pattern from Tam et al. Run the model once in natural language to produce the answer, then run a cheaper pass to reformat the answer into the schema. That recovers most of the accuracy loss from format restrictions, at the cost of doubling the inference call count per logical operation. It does nothing for value-correctness errors already present in the first response.
The second is the deterministic-recompute pattern. For any field that can be computed from other fields in the same payload (totals from line items, derived dates from base dates, classifications from threshold-crossing checks), recompute it deterministically post-emit and override the model's value. This is the unit-check pattern from last week generalized to schema fields. It catches arithmetic, threshold, and cascading-mistake errors, but does nothing for fields that originate in the source document.
The third is the authoritative-source-cross-check pattern. For any field that originates outside the payload (vendor names from a known-vendor list, invoice numbers from an upstream system, dates from a parsed-document layer), cross-check against the authoritative source as a separate pipeline step. This is the most expensive pattern, and the only one that catches all three failure modes. Most pipelines that need this level of correctness end up here eventually; most that should need it never get there, because the schema validator keeps reporting all-green and nobody flags a system that says it's healthy.
Validate the wrapper. Verify the contents. Treat schema validation as a parse check, never as a correctness check.
Works inside Cursor, Warp, VS Code, and every IDE.
Wispr Flow sits at the system level — dictate into any editor, terminal, or app with full syntax accuracy. No plugins needed. No setup per tool. 89% of messages sent with zero edits.
Quick Tip: Schema-Shadow
I've created schema-shadow, a bash script that audits a directory of LLM-emitted JSON payloads for the failure types a schema validator can't catch. Four checks per payload: silently-dropped fields (the model emits the schema but omits one or more populated fields), type-coerced values (string-shaped numbers, ISO-date strings parsed as plain text), array drift (length divergence beyond a configurable tolerance against a baseline), and field-order drift (the model emits the schema's keys in a different order, a soft signal that internal generation order shifted). Here's the silently-dropped-field core:
# Compare a payload's populated fields against a baseline payload of the same schema.
# Prints "missing|baseline_count|payload_count|payload_path" for any field the baseline
# populated and this payload omitted or null-emitted.
check_dropped_fields() {
local baseline="$1" payload="$2"
local baseline_keys payload_keys missing
baseline_keys=$(jq -r 'paths(scalars and . != null) | join(".")' "$baseline" | sort -u)
payload_keys=$(jq -r 'paths(scalars and . != null) | join(".")' "$payload" | sort -u)
missing=$(comm -23 <(echo "$baseline_keys") <(echo "$payload_keys"))
if [[ -n "$missing" ]]; then
while IFS= read -r key; do
echo "missing|${key}|${payload}"
done <<< "$missing"
return 0
fi
return 1
} Drop the full script on a directory of LLM-extracted contract data, agent-handoff state objects, or tool-call argument logs; it reports every drift class above with payload path, field path, and the specific divergence. The --baseline-dir mode lets you point at a known-correct sample set and audit a production batch against it. Full implementation in the bashmatica-scripts repo.
Quick Wins
🟢 Easy (15 min): Pull the last 50 LLM-emitted JSON payloads from a production pipeline and grep them for the schema's required fields. Count how many payloads populated every required field versus how many emitted a null, an empty string, or a placeholder. The gap between "schema validates" and "every field is populated" is the gap your validator was already tolerating.
🟡 Medium (1 hour): Run schema-shadow against the last 30 days of payloads from any LLM extraction pipeline you own. Use a clean known-correct payload as the baseline. Anything it flags is a quiet failure your schema validator was treating as a success.
🔴 Advanced (half day): Identify the three highest-stakes fields in your pipeline's largest LLM-emitted JSON contract (highest dollar exposure, highest compliance exposure, or highest downstream-blast-radius). For each one, build the authoritative-source cross-check: a deterministic recompute, a known-set lookup, or an upstream-system query. Run the cross-check inline and divert any divergent payload to a quarantine queue for review. The quarantine rate is your visibility into the failure mode.
Next Week
If structured-output validation is the wrapper that lies, the next layer down is the data the wrapper is supposed to describe: the document-layout pipeline that extracts the source values in the first place. PDFs, scanned invoices, OCR output, and layout-aware parsing are where values enter the LLM in a shape the LLM has to interpret. Next week: the failure modes of the document-extraction layer, why the most common failure is silent reordering rather than outright misreading, and a companion script for diffing extraction outputs against the source.
The 27-point drop on Shuffled Objects is the cleanest single number in the structured-output literature, but it isn't the critical finding for production pipelines. The critical finding is that schema validation reports green on payloads whose values are wrong about half the time, and the half that's wrong looks identical to the half that's right. That's the same failure shape Issue #12 described for arithmetic and Issue #11 described for dates: probabilistic output that produces a value in the right ballpark, in the right shape, with no pipeline-visible signal that anything has drifted. The wrapper holds. The contents lie. Nothing downstream has a way to tell, and by the time something does, the wrong value has propagated through every system that touched it.
Validate the schema. Verify the values. The schema is a parse check; the values need a separate authoritative source; the gap between the two is where every silent failure of the last three issues lives.
P.S. This issue resists a single tidy example. Arithmetic gives you "$100.02 off." Dates give you "Monday vs. Thursday." Schema validation gives you a JSON object that parses and a value inside it that's wrong, and there's no clean punchline because the punchline is precisely that there is no signal. If you've ever shipped an LLM-extracted field downstream because the JSON validated and called the work done, you've already met this failure mode. If this issue helped you find a quiet one in your own pipeline, forward it to someone who'd want to see it. If a colleague forwarded this to you, subscribe at bashmatica.com.
I can help you or your team with:
Production Health Monitors
Optimize Workflows
Deployment Automation
Test Automation
CI/CD Workflows
Pipeline & Automation Audits
Fixed-Fee Integration Checks