Engineering

The latency budget is the product.

Vigil Engineering·Apr 9, 2026·8 min read

A claim. The hardest engineering constraint on a consumer AI defense product is not detection accuracy, not the false positive rate, and not the threat model coverage. It is the latency the product adds to AI requests.

The user notices latency. They notice it in increments of a hundred milliseconds. At a hundred milliseconds, they suspect the network is slow. At three hundred, they wonder if something is broken. At five hundred or more, they look for a setting to disable whatever is in the way. Most enterprise defense products live at three to five hundred milliseconds of added latency, and they survive in enterprise contexts because the IT team installed them and the user cannot uninstall them.

A consumer defense product does not have this protection. The user can disable it. The user will disable it if the product makes their AI feel slow. Once disabled, the product produces zero security value, regardless of how good its detection ensemble is.

This is the existential constraint. Every architectural decision in Vigil flows from it.

The product is the silence of the proxy. Every millisecond the proxy adds is a millisecond closer to the user disabling it. The latency budget is not a feature requirement. It is whether the product exists at all.

This post is about the budget, what we measure against it, and the engineering trade-offs we made to live inside it.

The targets, in detail

The numbers we publish, with the methodology behind them.

Tier classification: under 10ms p99. Every request that crosses the proxy is classified into one of five tiers, based on the combined output of the four-model detection ensemble and the user's policy state. The classification has to complete before the proxy decides whether to forward the request unchanged, hold it, or block it. Sub-10ms is the budget. We hit it consistently on Apple Silicon and on Intel Macs.

Pass-through path: under 1ms typical. When tier classification produces a tier that does not require any further evaluation, the proxy forwards the request immediately. The added latency is the overhead of TLS termination, classification, and re-encryption. Sub-1ms is achievable when the policy rule cache hits.

Prevention path: under 50ms p99. When tier classification produces a tier that requires holding the action for evaluation, the policy plane runs the deterministic decision function, the Vault plane writes the audit chain entry, and the user-facing notification is dispatched. All of this happens in under 50ms, leaving the user's perceived latency dominated by their own response time, not by Vigil's processing.

Tier-3 watchdog: under 500ms. The watchdog fires when the detection ensemble produces high-agreement signals across multiple models on a session that has not previously had elevated tier. The watchdog is a slower path because it triggers a deeper analysis: cross-session pattern matching, capability set re-evaluation, and reconciliation against the user's authority chain. Sub-500ms is the budget. The watchdog is rare, the latency is mostly invisible because it runs in parallel with action handling, and the user does not block on it for routine actions.

These four numbers are the latency surface of the product. Every architectural decision either fits inside them or gets rejected.

What we gave up to hit the numbers

The tradeoffs are real. Three of the largest.

No remote inference. A defense layer that calls a cloud-hosted model for any classification step is paying network round-trip time twice (request and response) plus inference time. Even in the best case, that is fifty to one hundred milliseconds added per request, which destroys the budget. The detection ensemble runs entirely on the user's device. We cannot use larger models than fit on the device. We cannot use models that require GPU acceleration we do not have. These are real product constraints. We accept them because the alternative architecture does not survive the budget.

No synchronous external lookups in the policy path. The policy engine evaluates against the user's authorized capability set. The capability set is loaded into memory at startup and updated on attestation events. The evaluation does not call out to a remote authority during the request. If we had to call out, even to a fast remote authority, the budget would not hold. The capability state has to be local. We accept the engineering cost of keeping it local and synchronized.

Detection model size constrained by inference time. The Isolation Forest, the LSTM, the Bayesian model, and the CUSUM all have to produce inference results inside the tier classification budget. We could use larger models that produce slightly more accurate results. We do not, because slightly more accurate at twice the inference time produces a worse product. The model sizes are tuned to the budget, not to a research benchmark.

These tradeoffs are not regrettable. They are the architecture. A team that does not make these tradeoffs ends up with a product that adds more latency than the user can absorb, regardless of how good the detection is.

The engineering work, briefly

Hitting the budget required specific engineering choices. A few that matter.

Rust as the implementation language for the proxy and the analysis pipeline. Memory safety is a security property; it is also a latency property. A garbage-collected runtime would introduce pause times that, even when small, can spike to multiples of the budget under load. Rust's deterministic memory behavior is a hard requirement for hitting the prevention-path budget consistently.

The event bus is in-process. The five planes communicate through a bus, but the bus is a shared-memory channel within a single process. It is not an inter-process protocol, not a network protocol, and not a database round-trip. The bus dispatch is essentially free in latency terms. We made the planes architecturally separate without making them physically separate, which is the correct trade-off when latency is the dominant constraint.

The policy rule set compiles to a static dispatch table at startup. The rule evaluation is not interpreted; it is a chain of compiled decisions. The Execution Gate's decision function is, in the common case, a sequence of branch instructions that complete in nanoseconds. The five-hundred-line implementation is small precisely because it does not need runtime flexibility; the rules are configured out-of-band and compiled at load time.

The Vault chain write is parallel to the action emit. When the policy plane decides to forward an action, the action is emitted immediately, and the Vault chain entry is written in parallel. The user does not wait for the disk write. We chose to make the chain write asynchronous because the alternative blocks the action path on disk I/O, which is unpredictable in the worst case.

The detection models load eagerly at startup. Cold-start inference time would blow the budget. The models are loaded into memory once, at startup, and stay there. This costs us memory footprint, which we accept; the alternative is unacceptable latency on the first request after a cold proxy start.

These five choices, plus a few smaller ones, are how the budget gets hit. Every one of them is a deliberate decision, made early in the product's design, with full awareness of the alternative.

How we measure

A short note on methodology. We measure latency at the proxy level, end to end, including the time it takes the proxy to receive the request from the client, terminate the TLS connection, run the analysis pipeline, run the policy decision, and emit the action.

We measure on the user's actual machine, in production builds, not in synthetic benchmarks. The numbers in this post are p99 across our internal soak test, which runs the proxy at sustained load for hours against synthetic AI traffic that matches the request profile we observe in real use.

We do not publish averages. Averages are misleading for latency budgets, because the user notices the worst cases, not the typical case. p99 is closer to what the user experiences. p99.9 would be more honest still; we are working toward publishing it once we have enough data.

If a competitor publishes only average latency, ask them for p99. If they cannot produce it, the product has not been engineered against the budget. They are guessing.

What this means for buyers

A few specific things.

Ask for p99 latency, end to end, on production builds, on user-class hardware. Not synthetic benchmarks on optimized servers. The number that matters is what the user sees on their MacBook.

Ask whether any inference happens off-device. If the answer is yes, the product is paying network round-trip time on every request. The latency budget cannot hold under that constraint.

Ask for the latency budget as a percentage of typical AI request time. A defense layer that adds 50ms to a 3,000ms AI request is adding 1.7%. A defense layer that adds 300ms to the same request is adding 10%. The user notices the second number. They do not notice the first.

Ask for the worst-case behavior under sustained load. Some defense layers behave acceptably at low traffic and fall apart at heavy use. The user with a high-volume workflow is exactly the user who depends on the defense layer most. A product that cannot hold the budget under load is not a product they can use.

These four questions filter most of the products you will be evaluated against down to a small set. The small set is the set worth taking seriously.

The takeaway

The latency budget is not one of many constraints. It is the constraint. Every other engineering decision derives from it.

We hit it. We hit it because we designed the architecture around it from the first commit. The detection models, the planes, the bus, the policy engine, the chain writes are all chosen to fit within budgets that, taken together, produce a proxy that is essentially invisible to the user in the common case.

This is the correct engineering target for a consumer AI defense product. Anything looser produces a product that gets disabled. Anything tighter is unattainable on consumer hardware.

The product is the silence. The silence costs every engineer-hour we spend on it. We pay them.

← Back to The Vigil Journal