Engineering

The four-model detection ensemble, explained.

We get asked, often, why our detection layer is four models instead of one. The implied question is whether four is the result of engineering insecurity, where every team adds a model and nobody removes one. The answer is no. Each of the four catches a class of attack the other three miss. Removing any one of them produces a measurable hole. The ensemble is the minimum that covers the threat surface we have studied.

This post explains each model, what it catches, and what happens at the seams.

An ensemble is not a stack of redundant detectors. It is a set of orthogonal detectors aimed at non-overlapping attack classes.

The threat surface, divided four ways

Attacks against AI agents fall into four families that differ in how they look in the data.

Single-request anomalies. A request, in isolation, that is statistically unusual relative to the population of requests we have seen. Classic examples include obfuscated injection payloads, encoded instructions, and prompts that combine tokens in patterns rare across the training distribution.

Sequential drift. A session whose individual requests are each unremarkable but whose sequence of requests, read as a trajectory, indicates the agent's context is being progressively shifted. Almost every successful agent compromise we have studied since January 2026 is sequential drift, not single-request injection.

User-baseline deviation. A request, in isolation, that is statistically normal for the population but statistically abnormal for this specific user. Account compromise looks like this. So does delegation to a sub-agent that has different behavioral characteristics than the principal.

Regime shift. A change-point in the session's behavioral distribution. The session was operating in regime A, then transitioned to regime B. The transition is the signal. Hijacking events, mid-session tool injection, and context-poisoning all produce regime shifts.

Each family requires a different statistical instrument. We use four.

Model 1: Isolation Forest, for single-request anomalies

Isolation Forest builds an ensemble of random trees that partition the feature space. Anomalies are the points that get isolated quickly, in few splits, because they are far from the dense regions of the distribution. The algorithm is fast, parameter-light, and handles high-dimensional input cleanly.

We use it on per-request feature vectors that include token-level statistics, instruction-pattern features, encoding indicators, and structural properties of the prompt. The model is trained on a corpus of benign requests and evaluates each new request against that distribution.

The Isolation Forest is the front line. Latency budget is sub-millisecond. It catches the obvious cases: the prompts that no normal user generates, the encodings that scream attempted bypass, the structural patterns that match published injection payloads. It also catches the long tail of unusual-but-benign requests, which is why its score is a signal, not a verdict.

A single-model defense with only Isolation Forest would fail because most successful attacks do not produce single-request anomalies. They produce sequences.

Model 2: LSTM, for sequential drift

LSTMs (long short-term memory networks) were the workhorse of sequential modeling for a decade before transformers replaced them in language modeling. They are still excellent at exactly what they were built for: detecting whether a sequence is moving in a way the training distribution did not.

We use the LSTM on session-windowed sequences of request features. The model reads the trajectory of a session and emits a score that estimates how typical the trajectory is. Drift attacks, where the agent's context is gradually poisoned across multiple turns, produce trajectories that the LSTM has never seen. The score climbs.

The model needs roughly 222 observations to warm. That is not a tuning hyperparameter. It is the empirical number at which the false-positive curve flattens for a new user. Below the warm threshold, the model emits a low-confidence prior and the policy engine treats the score with appropriate skepticism. Above it, the model is reliable enough to trigger tier escalation.

A single-model defense with only LSTM would fail on the cold-start problem. New users would generate alerts that exhaust the alert budget before the model could adapt.

Model 3: Bayesian, for user-baseline deviation

The Isolation Forest and the LSTM are population models. They learn what is normal across all users. A Bayesian model, parameterized per user, learns what is normal for each user.

We use a hierarchical Bayesian model with a population prior and per-user posterior updates. New users start at the population prior; their posterior tightens around their behavior as they generate observations. Like the LSTM, the Bayesian model needs roughly 222 observations to converge usefully.

The job of the Bayesian model is to catch deviation from the user's own baseline that the population models miss. A request that is normal for the population but unusual for this specific user is the signature of account compromise, of delegation to a sub-agent with different characteristics, or of a known user being driven by social engineering toward behavior they would not normally exhibit.

The Bayesian also drives the false positive math. A request that is normal for both the population and the user produces a low score from all four models, which produces no alert. The deterministic policy still evaluates the action, but no alert is raised to the user. This is how we hold the under-three-alerts-per-day target.

A single-model defense with only the Bayesian would fail because the per-user model takes too long to train on novel attacks. The population models cover the gap.

Model 4: Multi-Window CUSUM, for regime shifts

CUSUM (cumulative sum control chart) is a change-point detection algorithm from sequential analysis. It tracks a running statistic and flags when the statistic crosses a threshold that indicates the underlying distribution has shifted. Multi-window CUSUM runs the test at multiple time scales simultaneously: a short window for fast hijack detection, a medium window for context drift, a long window for slow poisoning.

The CUSUM is what catches the moment a session changes character. The user was using the agent for code review; now the agent is suddenly transferring credentials. The session was operating in distribution A; the session has moved to distribution B. The transition is the signal.

CUSUM is not redundant with the LSTM. The LSTM detects that a trajectory is unusual relative to the training distribution. CUSUM detects that a trajectory's distribution has changed within a session. These are different questions. A session can be entirely typical against the population and still exhibit a change-point. A session can be unusual against the population and not exhibit a change-point.

A single-model defense with only CUSUM would fail because change-point detection requires a baseline. The other three models supply that baseline.

How the four interact

The four scores feed into the policy engine, which combines them according to a deterministic rule rather than a learned aggregator. We do not use a meta-classifier on top of the four because a meta-classifier is one more LLM-shaped object in the path, and the architecture is designed to keep statistical models out of the enforcement layer.

The combination rule is approximately this. Each model produces a score in a normalized range. The policy engine assigns a tier based on the highest score across the ensemble, modulated by the agreement among models. Three models scoring high while one scores low produces a different tier than four models scoring high. The tier feeds the Execution Gate, which determines whether actions hold pre-execution or proceed.

Disagreement among models is informative. The Isolation Forest scoring high while the LSTM scores low is a signal that the request is structurally unusual but the trajectory is normal. Probable noise, not attack. The LSTM scoring high while the Isolation Forest scores low is a signal that the requests are individually normal but the sequence is drifting. Probable drift attack. The deterministic policy reads the disagreement pattern and routes accordingly.

What happens at the seams

The interesting failures of an ensemble are at the seams between the models. We have spent more time studying these than studying any individual model.

The seam between the Isolation Forest and the LSTM is the boundary between request-level and sequence-level analysis. An attack that distributes a payload across multiple requests, each individually unremarkable, is in this seam. The LSTM is supposed to catch it. It does, after enough observations. Before warming, the seam is exposed, and the policy engine compensates by running stricter capability bounds on cold sessions.

The seam between the LSTM and the Bayesian is the boundary between population trajectory and user-specific trajectory. A trajectory that is normal for the population but unusual for the user. The Bayesian catches this once warmed. Before warming, the seam is exposed, and the policy engine compensates by treating cold-user actions in the Bayesian's pre-warm window with the same caution as cold-session actions in the LSTM's pre-warm window.

The seam between the CUSUM and everything else is the boundary between within-session change-point and across-session distribution. The CUSUM does not need warming because it operates within a session, but it does require a session of sufficient length to detect a change point. Very short sessions are exposed, and the policy engine compensates by treating actions in the first three or four turns of a session with elevated caution.

The compensation pattern is consistent. Whichever model is in its blind window, the deterministic policy raises the bar on the actions it allows without confirmation. The architecture stays sound during model warm-up because the policy engine is not relying on detection alone.

Latency

Tier classification, the combined evaluation across all four models, fits in under 10ms. The Isolation Forest runs in under a millisecond. The LSTM and Bayesian both fit in single-digit milliseconds when warmed. The CUSUM is essentially free. The aggregation is a deterministic function on four floats.

If we miss the latency budget, we lose the user. The product becomes friction instead of infrastructure, and friction gets uninstalled. The ensemble was designed inside the budget, not optimized down to it.

The takeaway

Single-model defenses fail because attacks are not single-class. The threat surface decomposes into four families. Each family needs an instrument that fits it. The ensemble is the minimum coverage. The interaction between the four is what produces the false-positive rate that holds under three alerts per day.

If a defense product is shipping with one detection model, ask which family it covers and what the seams look like. The answer determines whether you are evaluating a product or a demo.

We are happy to walk through the seams in detail with anyone evaluating the system. The ensemble is in the source. The thresholds are documented. The math is reproducible.

That is the bar.

← Back to The Vigil Journal