[Bashmatica] Off by Exactly $100

I asked a frontier model to total 25 invoice lines. It produced a believable, calculator-shaped wrong answer.

Dictate code. Wispr tags the files.

No re-typing. No context gaps. No mangled syntax. Works natively inside Cursor, Warp, and every IDE at the system level.

4x faster than typing. 89% of messages sent with zero edits. Used by engineering teams at OpenAI, Vercel, and Clay.

Let's Call It ~0-SUM Engineering

A few weeks ago, I read a blog post by Wes McKinney (the engineer behind pandas) that I couldn't quite shake. McKinney had been at an AI hackathon, building an LLM-powered repo activity summarizer, when he noticed the model couldn't reliably sum small lists of integers. He ran a structured test on Claude Sonnet 4.5 and GPT-4.1, and found that both models started failing somewhere around 20 to 25 single-digit numbers. Not failing dramatically, just quietly, with confident wrong answers in the right ballpark.

Last week I decided to verify that finding for myself, in the messiest realistic shape I could think of. I gave a fresh frontier model a 25-line invoice reconciliation, the kind of expense-report list a CFO's chief of staff would forward to an analyst on a Friday afternoon, and asked for one total. The model came back with $9,094.31. The actual sum was $9,194.33. Off by exactly $100.02 in a context where "off by exactly $100" looks like a transcription error and not a model failure, which is precisely the danger.

That's the failure mode this issue is about. Last week we covered LLMs and calendar arithmetic. The deeper question (the one Issue #11 closed by gesturing at) is what else is hiding in your pipeline that looks like computation but isn't.

Companion script for this issue: unit-check. Validates LLM-emitted unit conversions against deterministic constants and flags threshold-crossing errors. Details in the Quick Tip below.

For Further Reading

The Calculator-Shaped Hole In The Wall

A frontier LLM in 2026 will solve a tournament-grade competition math problem with surprising fluency. DeepSeek-R1 scored 97.3% on MATH-500. Gemini 2.5 Pro hit 92% on AIME 2024. Claude 3.7 Sonnet posted 96% on MATH-500. Read those numbers cold and the natural inference is that arithmetic is a solved problem for the current generation of models.

The same generation of models cannot reliably sum 25 single-digit numbers. McKinney measured it directly. I reproduced it on a 25-line invoice. The 2024 ORCA benchmark, which tested five frontier models on 500 real-world calculation tasks across finance, physics, health, and engineering, recorded a top score of 63%, and 68% of the errors logged were mechanical (rounding or arithmetic) rather than reasoning errors. The benchmarks and the production failures are measuring different things, and conflating them is exactly how this failure mode hides in production.

The mechanism is described in a paper from BlackboxNLP at ACL 2025 ("The Lookahead Limitation," arXiv 2502.19981). Current LLMs use a one-digit lookahead heuristic for carry propagation. When the model adds 47 + 38, it can anticipate that the units column (7 + 8 = 15) will produce a carry into the tens column. That works. When the model adds 47 + 38 + 29, the units column produces a sum (7 + 8 + 9 = 24) whose carry digit (2) exceeds what one-digit lookahead can track, and the tens column begins to drift. Add more operands, and the cascading carries compound past the heuristic's reach.

That's why a 25-row invoice is harder for the model than a four-step calculus problem. The competition problem rewards structured multi-step reasoning, which the model has been trained to produce. The invoice rewards cascading single-digit arithmetic across many operands, which the model approximates because the underlying mechanism cannot do it. The output looks like a calculation. There's no calculation underneath.

The other half of the picture is what the model emits when the answer is wrong. My reproduction came back as $9,094.31. The number ends in cents. It has the right magnitude. It would survive any reviewer who didn't independently sum the column. Compare that to a hallucination of the form "the total is approximately twelve thousand dollars," which any reviewer would catch. The probabilistic answer that looks like a calculator's output is the worse failure mode, because it ships.

WHERE Like Unlike Like

The exposure surface is anywhere your pipeline asks an LLM to produce a number that downstream systems will treat as computed. Three concrete cases.

Currency rollups, expense reconciliations, financial summaries. The McKinney case generalizes to any column-summation context: invoice line items, expense reports, monthly financial digests, ad-spend rollups, ARR by segment. FinanceBench, the financial-document QA benchmark published by Patronus AI in November 2023, recorded GPT-4-Turbo incorrectly answering or refusing 81% of 150 sampled questions. Current frontier models score better, but no replication has matched the breadth of FinanceBench since, so the lower bound on the failure mode is well established and the upper bound is unknown.

Unit conversions, especially near tier thresholds. I gave a frontier model a 454 kg pallet, the FedEx Freight tier definition (under 1000 lb vs. 1000+ lb, with a heavy-tier surcharge applied at the threshold), and asked which tier to file. The model used the full-precision conversion factor (1 kg = 2.20462 lb), arrived at 1000.9 lb, and correctly called the heavy tier. Good answer. Now consider what happens when the model picks the colloquial conversion factor (1 kg โ‰ˆ 2.2 lb) on the next call: 454 ร— 2.2 = 998.8 lb, which routes the same shipment under the threshold and into the wrong tier entirely. The bug isn't that the model gets the conversion wrong every time. The bug is that the precision it uses isn't pinned. Same input, different temperature draw, different tier classification. For a logistics or compliance pipeline, that variance is the production incident.

Numerical fidelity in pipeline rewrites. Last week I ran the same set of weekly metrics (CTR up, CVR up, ROAS up, CAC down, email open rate up) through two different framings: a "punchy plain-English" Slack digest and a "CMO board paragraph." The Slack digest stripped every specific number from the input ("more people clicked our ads, more of them actually checked out") and produced a directional summary with no quantitative content. The board paragraph quantified the conversion-rate lift correctly (29% relative on a 0.7-percentage-point absolute change), then chose not to quantify the CAC compression, the pipeline coverage expansion, or the share-of-voice change. Same data. Same model. Two prompts, two completely different numerical outputs flowing downstream, and no error logging will flag the divergence because nothing is technically wrong.

The common thread across all three cases is that nobody in the pipeline is verifying the numbers against an authoritative computation before they ship. The LLM produces values that look computed but aren't, and the downstream system has no way to tell, because the failure surface looks identical to the success surface in every observable way.

Over The Wall

Anthropic's official cookbook ships a calculator_tool.ipynb notebook in the tool_use directory, framed as the recommended pattern when "Claude needs to perform arithmetic operations based on user input." OpenAI publishes Code Interpreter as a first-class tool with the explicit framing that ChatGPT "uses pandas to analyze your data" rather than computing on its own. A 2023 OpenReview paper recorded a 27.5-point jump in MATH accuracy (42.2% โ†’ 69.7%) from no change other than letting GPT-4 execute Python rather than generate the answer directly.

None of these vendors will tell you in plain language that their model's arithmetic is unreliable. What they build is scaffolding that routes arithmetic away from the generative layer. A reasonable reader can draw their own conclusion about why.

The pattern has a name in the practitioner literature, even if it hasn't reached the major engineering blogs yet. A November 2025 post on dev.to (Trevor Lasn, "Trust the Server, Not the LLM") coined the phrase Zero Mental Math Architecture: shift all computation to deterministic backend code, reduce the LLM to a citation copy machine, treat probabilistic output as a presentation layer over deterministic data. The cleanest single-sentence version of the principle: "LLMs should be glorified UI for deterministic backends, not freelance accountants. Let Python do math, let the model copy receipts."

That's the thesis of this issue. Calculate the values. Let the LLM handle the conversation around them. The probabilistic layer is where prose, framing, narrative, and copywriting live, and those are exactly the things the model is good at. The deterministic wall is the architectural boundary that keeps each layer doing what it's actually competent to do, and any pipeline that doesn't have one is implicitly trusting a probabilistic process to produce values its downstream consumers will treat as computed.

Stop switching apps. Your browser can do it all.

Quick Tip: The Unit-Check Script

I've posted unit-check, a bash script that scans text for unit-conversion claims, recomputes them against full-precision constants, and flags both inaccuracies and threshold-crossing errors. Here's the validation core (the full version handles seven unit pairs, configurable tolerances, tier-aware threshold checks, and a --ci mode that exits non-zero on any flag):

# Validate a claimed kg-to-lb conversion against the deterministic constant.
# Prints "claimed|actual|input" on mismatch.
check_kg_to_lb() {
  local kg="$1" claimed_lb="$2" tolerance="${3:-0.05}"
  local actual_lb diff

  actual_lb=$(awk -v k="$kg" 'BEGIN { printf "%.4f", k * 2.20462262 }')
  diff=$(awk -v a="$actual_lb" -v c="$claimed_lb" \
    'BEGIN { d = a - c; if (d < 0) d = -d; printf "%.4f", d }')

  if (( $(awk -v d="$diff" -v t="$tolerance" 'BEGIN { print (d > t) }') )); then
    echo "${claimed_lb}|${actual_lb}|${kg}kg"
    return 0
  fi
  return 1
}

Drop the full script on a directory of LLM-generated logistics docs, shipping manifests, or compliance summaries; it flags every claimed-vs-computed mismatch beyond the configured tolerance, with file, line number, and the deterministic recomputation. The tier-aware mode also flags any conversion whose imprecision crosses a defined threshold, which is the case where "close enough" stops being close enough. Full implementation in the bashmatica-scripts repo.

Quick Wins

๐ŸŸขย Easy (15 min): Pick a recent LLM-generated artifact from your pipeline (a financial digest, a logistics summary, a metrics narration) and pull every number out by hand. Recompute three of them deterministically (sum, conversion, percentage change). The variance you find is the variance your downstream consumers have been silently absorbing.

๐ŸŸกย Medium (1 hour): Drop unit-check on the output directory of any scheduled LLM pipeline that touches conversions. Set a tolerance appropriate to the domain (0.5% for general logistics, 0.05% for compliance, tighter for pharmaceutical or regulated contexts). Run it on the last 30 days of output. Anything it flags is a quiet failure your pipeline was already shipping.

๐Ÿ”ดย Advanced (half day): Audit your full pipeline for every point at which an LLM emits a numerical value downstream consumers will treat as computed. For each one, replace the LLM's calculation with a deterministic equivalent (Python, SQL, whatever your stack uses) and pass the resulting value to the model as context, asking it only to narrate around the number rather than to produce it. Keep the LLM in the prose, keep the calculator on the math.

Next Week

If arithmetic is the cleanest case of "deterministic-looking output produced by a probabilistic process," the next-cleanest is structured data extraction: schemas, JSON outputs, field validation, the entire surface where an LLM produces values that look like a contract and aren't. Next week we'll look at where LLM-emitted structure breaks against downstream consumers, why every framework that ships structured-output validation does so for a reason, and a companion script that audits pipelines for schema drift the model can't see.

The $100.02 error in my reproduction is barely one percent of the $9,194.33 total, well inside the noise band any single accountant would shrug at on a one-off invoice. Multiply that error rate across thousands of LLM-generated financial documents per quarter, route them to systems that treat the numbers as computed, and the rounding error stops being noise and becomes a recurring liability with no error log to point at. The model didn't lie. The model also didn't compute. The pipeline that consumed the output had no way to tell, and that's the design failure worth fixing.

Calculate the values. Let the LLM handle the conversation. The deterministic layer and the probabilistic layer have different jobs, and the wall between them is what separates a pipeline that drifts from a pipeline that holds.

P.S. The reason this failure mode is hard to spot is that the wrong answers look like right answers. Hallucinations get caught because they read as obviously fabricated. Probabilistic arithmetic gets shipped because it reads as obviously calculated. If you've ever stared at a number in an LLM-generated report and thought "that looks right," and moved on, you've already met this failure mode. The fix is to stop looking. Compute the value, and let the model write the rest. If this issue helped you find a quiet one in your own pipeline, forward it to someone who'd want to see it.

I can help you or your team with:

  • Production Health Monitors

  • Optimize Workflows

  • Deployment Automation

  • Test Automation

  • CI/CD Workflows

  • Pipeline & Automation Audits

  • Fixed-Fee Integration Checks