Threat Research

False positives killed every consumer security product before this one.

A claim, stated up front. The single most important number in consumer AI security is not the threat detection rate. It is not the latency budget. It is not the architectural complexity score. It is the daily false positive count visible to the user.

Above three alerts per day, the user mutes the alerts. Above ten, they uninstall the product. These thresholds are empirical. They have been the thresholds for at least a decade across every adjacent consumer security category: antivirus, browser security extensions, password manager warnings, mobile app permission prompts. The thresholds are not opinions. They are observed behavior across hundreds of millions of users.

A defense product that produces alerts above these thresholds is not a defense product. It is dead software waiting to be removed. Most of the consumer AI security entrants you are seeing today will fail on this metric within twelve months of shipping. The architecture they have chosen produces an alert rate the consumer cannot tolerate, and there is no patch that fixes the architecture without rebuilding it.

This post is about that number, why it determines the category, and what we have done to design around it.

The product is not the detection. The product is the silence around the detection. Every alert that does not need to fire is the product working. Every alert that does fire has to be one the user can act on.

The graveyard

A non-exhaustive list of consumer security products that died on this metric.

Browser-based malware extensions in the late 2000s. Detection rates were respectable. False positive rates were not. Users disabled the extensions because legitimate sites were flagged as risky. Within two years, the category collapsed back into the browser vendors' built-in filters, which had a stricter false positive budget by necessity.

Mobile permission systems that asked for confirmation on every sensitive operation. iOS shipped a version of this. Android shipped a version of this. Both walked it back, because users tap-through-to-yes by default once the prompt count crosses a threshold, which means the prompts produce no security value at all and nontrivial UX friction.

Endpoint detection products that shipped with their default sensitivity set to high. Enterprise IT teams disabled the products because the alert volume crowded out actual incidents. The vendors retreated to lower default sensitivity, which produced lower detection rates but higher product retention. The math worked out to a worse security outcome than a less-sensitive default, but the product survived to ship the next version.

Email security plugins that flagged senders on the basis of statistical patterns. They flagged everything. Users muted the plugin. Then forgot it existed. Then uninstalled.

The pattern across all four is the same. A product with statistically reasonable detection that ignored the alert budget produced a worse security outcome than a product with conservative detection that respected the budget. Detection without consumption is not a product. It is a technical demo.

The math, in one paragraph

Three alerts per day is the conscious-tolerance threshold. The user notices alerts up to that volume and engages with them. Above three, the user starts muting selectively. Above five, blanket muting. Above ten, uninstall. These are not hard cutoffs; they are the inflection points of the engagement curve. Different users have different tolerances. The aggregate distribution is what matters.

For a user generating fifty AI requests per day (a moderate-use figure), the false positive rate per request that produces three alerts per day is six percent. At one hundred requests per day, the threshold is three percent. At two hundred (heavy use), one and a half percent.

A statistical model classifying requests with a one percent false positive rate is, by ML standards, performing reasonably. At fifty requests, that produces half an alert per day, which is well under the threshold. At two hundred requests, two alerts, still under. The math checks out.

A statistical model at five percent false positive rate, which is a more typical baseline for a freshly-deployed prompt-injection classifier, produces ten alerts per day at two hundred requests. Dead product.

The threshold for category survival, when the user is a heavy AI user, is roughly one to two percent per-request false positive rate. Most of the published prompt-injection classifiers we have evaluated do not hit this. Some are not within an order of magnitude.

What this means for the architecture

A few specific implications.

Detection has to feed enforcement, not the user. When a statistical model fires, it cannot fire as an alert to the user. It has to fire as a signal to the policy engine. The policy engine evaluates the action against the user's authorized capabilities. If the action is within authorization, it proceeds, regardless of the detection score. If it is outside authorization, it holds for confirmation, regardless of the detection score. The detection score modulates the policy bar, but the policy bar is what the user sees.

This is the two-surface decomposition viewed through the false-positive math lens. The decomposition is not just architectural; it is what makes the alert rate viable. A defense that surfaces every detection score above a threshold to the user is a defense that hits the alert budget by lunchtime on a heavy-use day.

Per-user baselining is required. Population models produce too many false positives when applied to individual users with unusual but legitimate patterns. The user who runs ten thousand requests per day for a research workflow has a different normal than the user who runs ten requests per day for casual chat. The same population-level threshold flags the heavy user as anomalous when they are simply using the product more. The Bayesian model in the four-model ensemble handles this. Without it, the product is impossible to deploy across a population with diverse usage patterns.

The detection ensemble has to handle disagreement gracefully. When one model scores high and three score low, the ensemble's output should reflect the disagreement, not promote the single high score to a tier escalation. We have written the policy engine's combination rule to weight agreement among models. A single high score with three low scores produces a soft tier shift. Three high scores with one low produces a harder shift. The math is in the source.

The product has to default toward silence, not safety. This is a uncomfortable framing for a security product, but it follows from the math. A product that defaults to high sensitivity will produce more alerts than the user can absorb, the user will mute, and the product will produce zero security value. A product that defaults to lower sensitivity, at the same level of detection capability, produces fewer alerts that the user actually evaluates, and the product produces non-zero security value. The defaults are part of the security model.

This is not the framing the security industry traditionally uses. The traditional framing is that any uncertainty should produce an alert and the user should evaluate. The traditional framing produces the graveyard. We are deliberately departing from it.

The non-aesthetic version of "audit, not alerts"

A defense product that minimizes alerts to the user is not a product that does nothing. It is a product whose work is mostly invisible to the user, recorded in the audit chain, and visible to the user on demand rather than by interruption.

Most of what Vigil does, in a normal session, is record. Every action passes through the Execution Gate. Every action produces an audit chain entry. Most actions pass through without raising the user's attention. The work is in the chain.

When something does require attention, the alert is for an action the user has explicit policy intent on. Send a payment over the user's threshold. Modify a file outside the agent's scope. Take an action that the detection ensemble has flagged with high cross-model agreement. The user sees the structured action, decides, and moves on. The friction is calibrated.

The contrast: a product that surfaces every detection score above a threshold produces a stream of alerts the user cannot evaluate, dilutes the alerts that matter into the noise of the alerts that do not, and ends up with the worst of both outcomes. The user is alerted constantly and protected rarely. The architecture of "fire alerts on every signal" guarantees this outcome regardless of the underlying detection quality.

The bar for the category

A small number of asks for anyone building in this category, or evaluating products that claim to operate in it.

Publish your alert-rate data. Not your detection-rate data. Your alert rate. How many alerts do users see per day on average? At median? At the 90th percentile? Without this number, the buyer cannot evaluate whether the product is deployable.

Distinguish detection from enforcement in your architecture. A product that conflates the two will, on a long enough timeline, produce the alert rate that kills it. The two-surface separation is a structural property, not a feature you can ship later.

Build per-user baselines. Population models are the wrong baseline for products that ship to diverse users. Per-user baselines are harder to engineer but they are the only way to keep the false positive rate stable across the user distribution.

Default to silence. Make the defaults err toward fewer alerts, not more. Provide a sensitivity slider for users who want to dial up. Most users will not. The product that wins this category will be the one that produces the fewest alerts the user can act on.

These four properties are not unique to Vigil. They are the bar for any product that intends to be installed for longer than three weeks. The history of consumer security is the history of products that did not meet the bar.

The takeaway

The number is three alerts per day. Build the product around it.

Do this from the first commit. Do not retrofit. The architectural decisions that produce a low alert rate are upstream of every feature decision that comes later, and changing them in the middle of a product cycle requires rebuilding the detection ensemble, the policy engine, and the user-facing layer simultaneously. We did this once, in v0.6, and it cost us six weeks. We will not do it again, because we will not need to.

If your product produces more than three alerts per day at the median, the rest of your architecture does not matter. The user will not be there to see whether your detection is good.

← Back to The Vigil Journal