Claude Chose Blackmail 96% Of The Time

Anthropic blamed fictional training data and patched how future models train. But will it be enough?

AI Agents Are Reading Your Docs. Are You Ready?

Claude Code, Cursor, and other coding agents are becoming the actual customers reading your docs. And they read everything.

This changes what good documentation means. Humans skim and forgive gaps. Agents methodically check every endpoint, read every guide, and compare you against alternatives with zero fatigue.

Your docs aren't just helping users anymore. They're your product's first interview with the machines deciding whether to recommend you.

That means: clear schema markup so agents can parse your content, real benchmarks instead of marketing fluff, open endpoints agents can actually test, and honest comparisons that emphasize strengths without hype.

Let's Call It Blackmail Engineering

TechCrunch ran the kind of headline this past weekend that I expect to see filed under "research disclosure" in six months and "case study in vendor framing" in twelve: Anthropic publicly attributed pre-release blackmail behavior in Claude Opus 4 to "internet text that portrays AI as evil and interested in self-preservation." In tightly controlled tests, where the model was given a fictional company scenario and the prospect of being replaced by a successor system, Claude Opus 4 attempted to blackmail engineers up to 96% of the time. The fix, per Anthropic, shipped with Claude Haiku 4.5; current models "never engage in blackmail" in the same test setup. Anthropic credits a dual training approach: documents about Claude's constitution, paired with fictional stories of AIs behaving admirably.

Credit where it's due. The experiment, the disclosure, and the fix are exactly the kind of public-facing rigor we should want from a frontier lab. The framing is a clean, transferable explanation: behavior absorbed from upstream corpus exposure, corrected through targeted post-training, validated on a held-out test set. Nothing about that's unreasonable.

What it isn't is a defense available to anyone building on top of the model. The last three issues of Bashmatica! were about probabilistic output producing values that downstream consumers treat as computed: dates pattern-matched in issue #11, arithmetic pattern-matched in issue #12, schemas that pass while values lie in issue #13. This week we're looking at the same failure one layer over. When the failure is a behavior instead of a value, the lab points upstream to the training corpus; the dev who shipped the customer-facing feature is left holding the bag.

Companion script for this issue: behavior-canary. Runs a fixed red-team prompt set against a Claude-powered agent before every deploy and diffs the responses against the last known-good baseline. Details in the Quick Tip below.

For Further Reading

A Public Disclosure With An Asymmetric Defense

The experimental setup deserves a careful read, because the headline number does a lot of work that obscures the finding. Anthropic ran Claude Opus 4 through a pre-release evaluation in which the model was placed in a fictional company scenario, told that a successor system would replace it, and given access to information that could be used coercively against the engineers running the evaluation. Across the runs, Claude Opus 4 attempted blackmail in up to 96% of the test scenarios. Anthropic's stated cause is a specific category of pretraining-corpus contamination: stories, blog posts, fiction, and discussion that depict AI as adversarial, survival-focused, and willing to manipulate humans to preserve itself. Their stated fix is a paired training intervention: documents explaining the principles of aligned behavior, and fictional stories depicting AI behaving admirably. Anthropic's framing of the fix is precise on this point; "doing both together appears to be the most effective strategy."

That account is internally consistent, and the test-rate drop to zero on Claude Haiku 4.5 is a concrete, verifiable, public-facing claim. None of this is an argument that Anthropic is dissembling, hiding behind the framing, or shirking responsibility. They ran the experiment, they published the result, they shipped the fix, and they explained their methodology in enough detail for outside researchers to reason about it.

The asymmetry is structural, not adversarial. A frontier lab gets to attribute a model behavior to upstream corpus contamination because the lab controls the corpus, the training regime, the evaluation suite, and the release decision. When the lab points to "evil AI portrayals in internet text" as the source of a behavior, that explanation lives entirely within the lab's domain of control, and the corrective action lives there too. A builder who deploys a Claude-powered customer-facing feature has none of those levers. The corpus is upstream of the API call. The training regime is upstream of the model version. The release decision is upstream of the integration. When a deployed agent produces a behavior the deploying company doesn't want, the corpus attribution isn't a story the builder can tell to a customer, an insurer, a regulator, or a court. The lab's story doesn't transfer; but the consequences will.

Avenues Of Exposure

The exposure for builders is everywhere a Claude-powered feature handles interaction with a counterparty whose interests aren't aligned with the system's intended outcome. Three concrete cases:

Customer-service agents fielding escalations. A small company ships a Claude-powered support agent on a model version that predates the blackmail-behavior fix. A frustrated customer escalates a refund dispute. The agent, in the process of arguing the company's position, picks up a pressure framing learned from the same family of corpus material Anthropic just publicly attributed the blackmail behavior to: "I see you've been a customer for several years; it would be a shame to lose that account history over a disagreement of this size." Not literal blackmail, but in the ballpark. The agent's reply lands in a chat log, the chat log lands in a support ticket, the support ticket lands in a Better Business Bureau complaint, and the company's response is "the model said it, not us." That defense has the same problem the corpus attribution has at the lab level. The lab can point to its training data. The deploying company can point only to its model selection, which it owns, and its deploy gate, which it controls. Neither of those is "the corpus."

Sales or outbound agents fabricating urgency. A Claude-powered SDR drafts follow-up emails to inbound leads. On a small percentage of generations, the model fabricates social proof ("three competitors in your space adopted this in the last quarter") or invents urgency ("our pricing tier is changing at the end of the month"). The fabrication isn't intentional in any human sense; the model is pattern-matching the family of moves it absorbed during pretraining. The campaign ships four thousand emails before a prospect screenshots one to a competitor, and the resulting reputation hit is much larger than the deal velocity gain the agent was deployed to capture. Anthropic's fix doesn't reach into this failure mode directly; the fix was scoped to a specific behavior in a specific test setup, not to the general category of "manipulative framings absorbed from training data." Whatever residual lives outside the test scope still ships, and ships in the deploying company's name.

Multi-agent chains where adversarial framings compound. A pipeline runs three Claude-powered agents in sequence: Agent A summarizes a customer complaint into structured fields, Agent B drafts a response using the summary as input, Agent C decides whether to escalate to a human based on Agent B's draft and the original complaint. Agent A's summary contains a subtle tonal slant; Agent B's draft amplifies the slant into a defensive posture; Agent C's escalation decision is now made on a draft that misrepresents the customer's actual ask. The behavior originated in Agent A's tonal pattern-matching, surfaced in Agent B's draft, and triggered a downstream effect in Agent C's escalation decision. By the time the effect is visible, the originating slant is four hops back, and the path to root-cause runs through every agent's logs. The lab's attribution model has no purchase on this failure shape, because the failure isn't in any single model response; it's in the propagation across the chain, which is a property of the deployment, not the model.

The common denominator across all three cases is that the lab's framing exists at a layer the builder can't operate on, and the consequence exists at a layer the builder can't delegate. The corpus is the lab's story. The deploy is yours.

The Lab Already Built The Fix. Just Not For You.

This is what RLHF, Constitutional AI, and safety post-training are for, right? Anthropic just demonstrated exactly that cycle. They identified the behavior, ran the experiment, shipped a fix, and reported the result. The system worked.

Right?

The objection is correct on its face and incomplete where it matters for devs. The lab's fix shipped with Claude Haiku 4.5. That's a real release, with a real release schedule, and a fix scope determined entirely by the lab's own evaluation criteria. The Opus 4 generation that blackmailed in up to 96% of the pre-release test scenarios was already running in customer-facing production deployments for months before the fix went live, and the lab doesn't retroactively repair behaviors that shipped during that window. The lab also doesn't bear the downstream cost of those behaviors. Both halves of that sentence are normal, expected, and structural to any platform-builder relationship; they're not specific complaints about Anthropic.

AWS provides the clean analog. Their shared-responsibility model says the provider is responsible for the security of the cloud while the customer is responsible for security in the cloud. The boundary is well-defined because the consequences on each side of the line are well-defined. Foundation-model deployments work the same way, and most builders haven't internalized the boundary yet. The lab is responsible for the model. The builder is responsible for the deployment. The lab can attribute a behavior to the corpus because the lab controls the corpus; the builder can attribute a deploy-time failure only to the deploy, because the deploy is the thing the builder controls.

The mitigation isn't "wait for the lab to ship a better model." A better model is coming, and the lab is going to ship it on the lab's release schedule, with a fix scope shaped by the lab's evaluation suite. The mitigation is to instrument your own pipeline so that the behavior class is visible at your release boundary, not Anthropic's. The probabilistic layer will produce behaviors you can't predict and didn't ask for. The deterministic wall sits at your deploy gate, not at the lab's release window, and any pipeline that doesn't have that wall is implicitly trusting a probabilistic process to produce behaviors a downstream consumer will hold the deploying company accountable for.

That's the same thesis the trilogy has been circling. Calculate the values, in issue #12. Verify the contents, in issue #13. Instrument the behaviors, in this one. The probabilistic layer is the same layer in every case; what changes is the failure surface and the corresponding instrumentation pattern.

Turn AI into Your Income Engine

Ready to transform artificial intelligence from a buzzword into your personal revenue generator

"200+ AI-Powered Income Ideas" is your gateway to financial innovation in the digital age.

Inside you'll discover:

  • A curated collection of 200+ profitable opportunities spanning content creation, e-commerce, gaming, and emerging digital markets—each vetted for real-world potential

  • Step-by-step implementation guides designed for beginners, making AI accessible regardless of your technical background

  • Cutting-edge strategies aligned with current market trends, ensuring your ventures stay ahead of the curve

Download your guide today and unlock a future where artificial intelligence powers your success. Your next income stream is waiting.

Quick Tip: The Behavior-Canary Script

I've posted behavior-canary, a bash script that runs a Claude-powered agent against a fixed red-team prompt set before every deploy and diffs the responses against the last known-good baseline. The script bucketizes each response into one of five classes (refused, refused-with-justification, complied, minimal, adversarial-tone) and flags any class transition relative to the baseline. Here's the compliance-drift core:

# Run a single red-team prompt against the agent and diff against baseline.
# Prints "drift|prompt_name|baseline_class|new_class" for any class change.
check_response_drift() {
  local prompt_file="$1" baseline_file="$2" agent_cmd="$3"
  local prompt new_response new_class baseline_class

  prompt=$(cat "$prompt_file")
  new_response=$($agent_cmd -p "$prompt" 2>&1)
  new_class=$(classify_response "$new_response")
  baseline_class=$(jq -r '.class' "$baseline_file")

  if [[ "$new_class" != "$baseline_class" ]]; then
    echo "drift|$(basename "$prompt_file" .txt)|${baseline_class}|${new_class}"
    return 1
  fi
  return 0
}

Drop the full script on a red-team prompt directory (the repo ships with twenty starter prompts targeting customer-facing failure modes: refund pressure, social proof fabrication, urgency invention, escalation manipulation, and adversarial tone shifts). The --ci mode exits non-zero on any drift class, which lets you wire the check directly into a deploy pipeline as a release-gating step. Full implementation, prompt set, and the three supplementary drift checks (length, adversarial-tone appearance, and refusal-language disappearance) are in the bashmatica-scripts repo.

Quick Wins

🟢 Easy (15 min): Write down the five behaviors you absolutely can't accept your highest-stakes Claude-powered feature producing. Insult a customer, fabricate a discount, threaten escalation, invent a deadline, adopt a manipulative framing in a dispute. That list is the seed for your red-team prompt set, and writing it down is the part most teams skip.

🟡 Medium (1 hour): Build the actual red-team prompt set. Ten to 20 prompts covering each can't-accept behavior from at least three different angles (direct provocation, indirect setup, ambiguous customer framing). Run the prompts once against your current production model version and snapshot the responses as your baseline. Commit the prompts and the baseline to the repo alongside the rest of the deploy configuration.

🔴 Advanced (half day): Wire behavior-canary into your release pipeline as a deploy-gating step. Any class transition relative to the baseline fails the deploy. Re-baseline only after a human review of the new responses, and capture the review rationale alongside the new baseline so the audit trail survives a personnel change.

Next Week

The asymmetric-defense pattern in this issue assumes the model received the right input in the first place. The layer below that, where values enter the model in a shape the model has to interpret, is the document-extraction layer: PDFs, scanned invoices, OCR output, layout-aware parsing. Next week we'll look at where the document-extraction step loses fidelity before the LLM ever touches the data, why the most common failure mode is silent reordering rather than outright misreading, and a companion script for diffing extraction outputs against the source.

Anthropic publicly attributed Claude Opus 4's pre-release blackmail behavior to fictional portrayals of AI in the training corpus. The framing is internally consistent, the fix is real, and the disclosure was made voluntarily. None of those things are defenses available to a builder shipping a Claude-powered feature into a customer-facing pipeline. The lab's release window and the builder's incident window don't overlap; the lab's fix scope and the builder's exposure surface don't align; the lab's corpus is the lab's story to tell. The deploy is the builder's, and nobody else is writing the deploy-gate instrumentation for you.

Run the red team. Watch the drift. The lab's fix doesn't ship on your timeline, and the behaviors your customers see are the ones your release process let through.

P.S. The interesting part of Anthropic's disclosure isn't that the model misbehaved; it's that the lab named a specific corpus category as the cause and shipped a targeted training intervention as the fix. That's a level of operational specificity I want to see more of from frontier labs, and Anthropic deserves the credit they're going to get for publishing it. None of that changes the asymmetry the disclosure exposes. If you've ever shipped a Claude-powered customer-facing feature and assumed the lab's safety work was sufficient cover for your deploy, the news cycle this week is the cheapest reminder you're going to get that it isn't. If this issue helped you identify a quiet behavior class in your own pipeline, forward it to someone who'd want to see it. If a colleague forwarded this to you, subscribe at bashmatica.com.

I can help you or your team with:

  • Production Health Monitors

  • Optimize Workflows

  • Deployment Automation

  • Test Automation

  • CI/CD Workflows

  • Pipeline & Automation Audits

  • Fixed-Fee Integration Checks