Threat Research

Prompt injection is a statistical problem, not a security patch.

Dipendra Jain·Apr 8, 2026·12 min read

The industry has been treating prompt injection as a vulnerability. Vulnerabilities get patched. A patch fixes a deterministic flaw in deterministic code. Apply the patch, the flaw is gone. Verify against a regression test. Close the ticket.

This framing has been wrong from the first paper. Prompt injection is not a vulnerability in that sense. It is a statistical property of the system. You cannot patch it. You can constrain its probability. You can shrink the attack surface. You can make it harder to exploit reliably. You cannot eliminate it.

Anyone selling you a patch is selling you a story.

The right defense is not deterministic code that catches a deterministic attack. It is statistical detection feeding deterministic enforcement on a different surface than the one being attacked.

That sentence is the entire architectural thesis. The rest of this post unpacks it.

What makes a vulnerability deterministic

Take a classic SQL injection. The flaw is in the way input concatenates into a query string. Input X produces query Y. If Y is malformed, the attacker controls the database. The flaw is reproducible. The patch is a bound parameter. After the patch, input X never produces query Y again. The vulnerability is closed.

Buffer overflows are the same. Input of length N+1 into a buffer of length N produces a deterministic memory write. The patch is a bounds check. After the patch, the write does not happen.

Cross-site scripting is the same. The patch is output encoding. After the patch, the script tag does not execute.

In every case, the input-output relationship is deterministic. The vulnerability exists because the code did not check a property it should have checked. The patch adds the check. The patch is verifiable. You can write a test that asserts the patched code produces the safe output for the malicious input.

This is the entire mental model security engineers have been trained on for thirty years. It works because traditional software is deterministic.

What makes prompt injection different

An LLM is not deterministic in the sense that matters here. The same prompt produces different outputs. The same context window with the same instruction at position N can produce one output today and a different output after a model update. The model's response is sampled from a probability distribution over tokens. Even at temperature zero, the distribution depends on every token in the context, which means a single character changed three thousand tokens earlier can flip the response.

Prompt injection exploits this property. The attacker does not need a single deterministic input that produces a single deterministic compromise. They need an input that shifts the probability distribution toward a compromised output. The shift can be small. It can be reliable enough to be commercially attractive at a 30 percent success rate. It does not need to work every time.

This is why every "patch" for prompt injection has failed.

The provider adds a system prompt instructing the model to ignore injection attempts. The attacker writes a prompt that asks the model to ignore the system prompt. The success rate drops, then climbs back when the attacker iterates. The provider adds a guardrail model. The attacker constructs prompts that the guardrail scores as benign. The success rate drops, then climbs back. The provider adds output filtering. The attacker shifts to actions that pass the filter but still cause the harm. The success rate drops, then climbs back.

The pattern is the same in every iteration. A statistical defense produces a statistical reduction. The reduction is real but it is not closure. It cannot be closure. There is no patch that makes the model deterministic, because the model's non-determinism is not a bug in the model. It is the model.

The two-surface decomposition

If prompt injection cannot be patched, the question becomes: what is the right architecture for defense?

The answer is to stop treating the request as the security surface.

Almost every published prompt injection defense scans the request. The instruction the user typed, the documents the agent ingested, the tool descriptions, the system prompt. The defense reads what is going in and tries to predict whether it will cause something bad to come out. This is intent classification. It is hard, it is statistical, and it produces a brutal false positive rate at any sensitivity high enough to catch sophisticated attacks.

We do not scan the request. We scan the response.

The decomposition is this. The request is intent only. We capture it for context, we use it to inform the statistical model, but we do not enforce against it. The response is the security surface. When the model emits an action, we evaluate that action deterministically against a policy that has nothing to do with the prompt. The action either is or is not within the authorized capability. There is no probability involved at the enforcement layer.

This works because the actions an AI agent can take are a small, finite set. Send an email. Make a payment. Modify a file. Call an API. Each action has a well-defined consequence and a well-defined authorization requirement. The policy that says "this user has not authorized this agent to make payments above $500 without confirmation" is a deterministic rule. It does not care whether the agent's instruction to make the payment was the result of a clean prompt or a poisoned context. It cares whether the action is within authorization. If it is not, the action holds in the Execution Gate until a human confirms.

This is the only architecturally sound answer to prompt injection. Statistical detection on the request. Deterministic enforcement on the response. The two surfaces are separate. The two layers do not touch.

Why this also closes the prompt-injection-of-the-defense problem

A defense layer that uses an LLM to evaluate whether an action is safe is a defense layer that can itself be prompt-injected. The attacker's prompt produces a poisoned context for the agent, which emits an action, which is evaluated by the defense LLM, which is now reading attacker-controlled content as part of its evaluation context. If the defense LLM can be talked into approving the action, the defense layer is compromised.

This has happened in production. It will happen again. Every defense product that ships an LLM in the enforcement path is shipping a vulnerability as a feature.

In our architecture, the LLM is in the detection layer only. Detection produces a signal. Signal is a number. Number flows to a deterministic policy engine. Policy engine reads the number and the action and the user's authorized capabilities, applies a rule, and produces a decision. The decision is binary. Allow, hold, or block. There is no LLM in the path between the action and the decision.

A motivated attacker can sometimes shift a detection score. They cannot prompt-inject a Boolean comparison.

The false positive math

The reason most security products fail in consumer use is not that they miss attacks. It is that they raise alerts the user cannot evaluate. The user mutes the alerts. Then uninstalls.

Every defender has a number for the maximum tolerable alert rate. For consumer security, the number is roughly three alerts per day. Above three, mute. Above ten, uninstall. The number is empirical. It has been the number for at least a decade.

If your defense is statistical and you scan the request, your false positive rate is a function of how many requests the user makes per day times the false positive probability per request. A user who sends fifty AI requests a day at a one percent false positive rate generates a half-alert per day. At a five percent rate, two and a half. At a ten percent rate, five. The product is dead at the ten percent rate.

If your defense is deterministic and you scan the response, your false positive rate depends on whether the action falls within the policy. If it does, no alert. If it does not, the alert is not a false positive. It is a statement that the action requires confirmation.

The deterministic-on-the-response architecture has a structural false positive advantage that the statistical-on-the-request architecture cannot match. The reason it cannot match is not engineering. It is information theory. You cannot deduce intent from a prompt with high reliability. You can read an action and check it against a policy with perfect reliability.

What we built

Vigil's pipeline is the two-surface decomposition. The request goes through statistical detection, four models running in ensemble, latency budget under 10ms for tier classification. The signals flow into the Vault for audit and into the policy engine as inputs. The response, when the agent emits an action, goes through the Execution Gate, which evaluates the action against the user's authorized capabilities and the policy rules, deterministically, in under 50ms for the prevention path.

There is no LLM between the action and the decision. There never will be.

The detection models can be wrong. The detection models will be wrong. False positive in detection raises a tier, which biases the policy toward holding ambiguous actions for confirmation. False negative in detection means the policy does not get the elevated tier signal. In that case the policy still evaluates the action against capability authorization. If the action is within authorization, it proceeds. If it is not, it holds.

The architecture survives detection errors because detection is not in the enforcement path. That is the property that matters.

The takeaway

Prompt injection cannot be patched. Stop treating it as a vulnerability. Treat it as a statistical property of the system that requires statistical detection feeding deterministic enforcement on a different surface than the one being attacked.

If your defense scans the request, you are doing intent classification on probabilistic input. You will lose this fight, repeatedly, at every sensitivity threshold.

If your defense uses an LLM to decide whether actions are safe, you have shipped an exploitable defense. You will discover this in production.

If your defense decomposes the pipeline two ways, with statistical detection on the request and deterministic enforcement on the response, you have an architecture that survives the attacker iterating against you. The attacker can shift detection scores. They cannot shift the policy engine.

This is the architectural decision that determines whether the category of products you are building has a future. The wrong frame produces a security patch arms race that defenders eventually lose. The right frame produces infrastructure.

We are building infrastructure.

← Back to The Vigil Journal