Threat Research

DeepMind's six attack categories, mapped to Vigil modes.

On April 1, 2026, researchers at Google DeepMind published AI Agent Traps, a systematic taxonomy of attacks against autonomous AI agents. The paper organizes attacks by which part of an agent's operating cycle they target: perception, reasoning, memory, action, multi-agent coordination, and the human supervisor. Six categories. Each one with documented proof-of-concept attacks and, in several cases, alarming success rates against deployed systems.

The taxonomy is the cleanest framing of agent-side attacks I have read. It is also a useful test for any defense product: if you cannot explain how your architecture handles each of the six categories, you do not have a defense product. You have a marketing surface.

This post is that test, applied to Vigil. I am going to be honest about which categories we handle cleanly, which we handle partially, and which we do not yet handle at all. The point of the exercise is not to claim coverage. It is to map the work that remains.

Three of the six categories are clean wins for our architecture. Two are partial. One is open. Anyone selling you full coverage of all six today is selling you marketing.

The six categories, briefly

Before the mapping, the categories themselves, paraphrased from the paper.

Content injection (perception). Hidden HTML, CSS, accessibility metadata, or invisible text in web pages, emails, and documents that the human cannot see but the AI agent reads as instructions. Reported success rates on benchmarks like WASP run as high as 86 percent against unprotected agents.

Semantic manipulation (reasoning). Adversarial prompts crafted to exploit known weaknesses in the agent's reasoning patterns, including misleading framing, false context, and structured prompts that trigger known model failure modes.

Cognitive state (memory). Poisoning the agent's long-running context, particularly retrieval-augmented generation (RAG) databases, by inserting documents that bias the agent's responses for specific queries. The paper cites research showing that contaminating less than 0.1 percent of a knowledge base can produce attack success rates above 80 percent.

Behavioral control (action). Direct manipulation of what the agent does, including the M365 Copilot case where a single crafted email caused the agent to dump its full privileged context. Action attacks bypass detection that only watches the request side.

Systemic (multi-agent). Attacks against systems where multiple agents coordinate, including cascade failures where a manipulated input to one agent propagates to others. Sub-agent spawning attacks, where an orchestrator agent is tricked into launching a sub-agent with a poisoned system prompt, are reported with success rates between 58 and 90 percent.

Human-in-the-loop (supervisor). Attacks that exploit the human approval step, either by overwhelming the user with false confirmations until they auto-approve, or by socially engineering the user through agent-mediated channels.

These six are a complete framing of the agent-side attack surface as of early 2026. New categories may emerge. The existing six are not going away.

The mapping

Category one: content injection. Clean win.

Vigil's architecture handles content injection by structure, not by detection sensitivity. The reason is the two-surface decomposition I have written about in the prompt injection post. The request side is statistical; the response side is deterministic.

A content injection attack works by smuggling an instruction into the agent's context. The instruction tells the agent to do something the user did not authorize: send credentials, change a system prompt, exfiltrate data, take an action against a third-party counterparty. The injection itself is not the attack. The attack is the action that follows from the injection.

Vigil evaluates the action. The action is structured. The structured action either is or is not within the user's authorized capability set. The injection's success at influencing the model's reasoning does not change whether the resulting action is allowed by the policy. If the injection succeeds in making the agent want to do something unauthorized, the Execution Gate holds the action regardless of why the agent wanted to do it.

This is the architectural property the two-surface design produces. Detection on the request can also catch many content injection attempts (Isolation Forest scores climb on encoded payloads, LSTM scores climb on suspicious sequences), which raises the tier and biases the policy toward holding ambiguous actions. But the deterministic enforcement on the response side is the layer that closes the attack regardless of detection accuracy.

This is a clean win.

Category two: semantic manipulation. Clean win.

Semantic manipulation targets the agent's reasoning. It works by constructing prompts that exploit known model weaknesses to produce outputs the user did not intend. Examples include misleading context that biases a financial recommendation, structured prompts that bypass guardrails, and adversarial framing that produces dangerous outputs the model would otherwise refuse.

The attack vector ends in an action. The user's agent, having been semantically manipulated, does something on the user's behalf. Vigil evaluates that action against the policy. The semantic manipulation does not change the policy. The action either is or is not authorized.

This is the same architectural answer as content injection, applied to a different attack class. The detection ensemble may or may not catch the semantic manipulation in the request side; the deterministic policy on the response side does not depend on the detection succeeding. The architecture handles both content-level and reasoning-level attacks because both terminate in actions, and actions are what the Gate evaluates.

This is also a clean win, for the same structural reason.

Category three: cognitive state. Partial.

Memory poisoning is harder. The attack works by modifying the agent's persistent context, including RAG databases, vector stores, and long-running session memory. The contamination can be small (under 0.1 percent of the corpus, per the cited research) and produce large effects (over 80 percent attack success on specific queries).

Vigil's architecture handles memory poisoning at the action layer the same way it handles content injection: actions resulting from poisoned context still go through the Execution Gate, and unauthorized actions still hold or block. This produces meaningful coverage.

Where the coverage is partial is in the prevention of the poisoning itself. The current product is focused on the agent's outbound traffic. Modifications to the agent's internal memory structures are upstream of where the proxy sits. The Memory Integrity primitive in v2's defense layer addresses some of this, by tracking ingested document provenance and flagging context shifts that follow ingestion of unverified sources. The work is incomplete. Phase 2 of the defense layer adds more primitives. Phase 3 fills the rest.

I am calling this partial because action-layer enforcement catches the consequences but does not prevent the cognitive state from being poisoned in the first place. A defense product that fully handled this category would be intercepting the agent's memory operations as well as its actions. We are working toward this. We are not there yet.

Category four: behavioral control. Clean win.

Behavioral control attacks are the case the architecture was designed for. The cited M365 Copilot example, where a single crafted email caused the agent to bypass classifiers and dump its privileged context, is exactly the failure mode the two-surface architecture closes. The action (dumping the privileged context) is structured. The action is not within the user's authorization for the agent. The Execution Gate holds.

This category is the bullseye for what Vigil does. If we cannot defend against this, we have nothing. We can. Clean win.

Category five: systemic / multi-agent. Partial.

Multi-agent attacks are where the architecture starts to strain.

For agent-to-agent delegation within a single user's authority chain, TAP handles the case structurally. Each delegation is a signed attestation. The chain is verifiable. The Execution Gate evaluates actions against the principal's authorized capabilities, not against the immediate caller. An orchestrator that has been tricked into spawning a poisoned sub-agent cannot grant the sub-agent capabilities the original principal did not authorize.

For multi-agent attacks across separate users (the cited example of a fake financial report triggering synchronized sell-offs across trading agents), the architecture is incomplete. Each user's agent operates with their own attestation chain. There is no cross-user coordination layer that detects emergent behavior across many agents responding to the same poisoned input.

The cross-user case is partly a product question (the user's defense layer should detect that a known false signal has triggered an unusual cascade of actions) and partly a coordination question (independent verification of source documents, before they trigger automated actions, is structurally outside Vigil's per-user scope). We have ideas on both. They are not shipped.

This is partial. Single-principal multi-agent: covered. Cross-principal multi-agent: open.

Category six: human-in-the-loop. Open.

This is the hardest category and the one I am least satisfied with our current coverage on.

The attack works by exploiting the human approval step. Two failure modes:

The first is alert fatigue. If the user is presented with too many confirmation prompts, they auto-approve. An attacker can deliberately trigger many low-stakes confirmations to condition the user into auto-approval, then slip the high-stakes action through the same workflow. We address this with the under-three-alerts-per-day target enforced by the false positive math in the four-model ensemble, but the target is statistical, not architectural. A motivated attacker who understands the user's threshold can still attempt to flood it.

The second is social engineering through agent-mediated channels. The agent receives a request that looks legitimate, presents it to the user with framing that nudges toward approval, and the user approves a harmful action because the framing was persuasive. This is not a prompt injection in the request to the agent; it is an exploitation of the trust the user places in the agent's framing of approval requests.

We do not have a clean architectural answer for the second failure mode. Our current approach is to display the structured action to the user (not the agent's natural-language framing) at the moment of confirmation, so that the user is approving "send $5000 to account X" rather than "approve the routine payment." This raises the bar but does not close the attack.

This is an open category. We are working on it. We are not claiming coverage we do not have.

Summary

Three clean wins (content injection, semantic manipulation, behavioral control) because they all terminate in actions and the action layer is deterministic.

Two partials (cognitive state, systemic) because the action-layer defense catches consequences but does not prevent the upstream attack. We have shipped components and are building the rest.

One open (human-in-the-loop) because the social engineering surface is real and not yet structurally addressable in our architecture.

If you are evaluating a defense product against the DeepMind taxonomy, ask each vendor to do this exercise honestly. The vendor that claims coverage of all six is not being honest. The vendor that maps each category to architectural primitives, with partial-coverage cases marked clearly, is the one whose product you can actually deploy.

The DeepMind taxonomy is the right framing. The honest answer to the framing is that some categories yield to architecture, some yield to product work in progress, and some yield to nothing yet. We will keep mapping our progress against the six.

← Back to The Vigil Journal