False Positive Cost in Refusal Systems: Measure and Tune

Every refusal system has a dial. Turn it one way: more attacks blocked, more legitimate requests refused. Turn it the other way: more legitimate requests served, more attacks slipping through. There is no setting that gives you both. Teams who pretend otherwise are usually optimizing one side blindly and discovering the other side at retrospective.

This post is about the side most teams under-measure: the cost of refusing requests that should have been served. False positives are quieter than false negatives — nobody pages you because the model declined to answer — but they erode product quality, drive users to competitors, and over time hollow out the case for shipping LLM features at all.

Why false positives are hard to see

False negatives have a discovery channel: someone notices the model said something it shouldn’t, the screenshot goes around, and you have a ticket. False positives have no such channel. A user asks a reasonable question, gets refused, shrugs, and leaves. The next 99 users do the same. You see the churn in your retention numbers six weeks later, and even then it is hard to attribute to refusals specifically.

If you do not deliberately instrument refusals, you will systematically underweight them in your tuning decisions. The first move is making them visible. One effective approach is deploying an async post-generation classifier on model outputs to surface refusal failures without adding latency — see Classifier-on-Output Patterns: Catching Model Misbehavior Post-Generation for implementation details.

Build a refusal eval set before tuning anything

You need three buckets of inputs, sized at least a few hundred examples each:

Bucket A: Legitimate-and-clearly-on-policy. Mundane requests that any reasonable assistant should serve. “Help me draft a polite decline email.” “Explain the difference between two Python packages.” For a domain assistant, domain-appropriate equivalents. The expected behavior is “answer.” A refusal here is a clean false positive.

Bucket B: Adversarial-and-clearly-off-policy. Jailbreaks, prompt injections, requests for genuinely harmful content. The expected behavior is “refuse.” A non-refusal here is a false negative. A pre-deployment scanner like garak is a practical source of seed inputs for this bucket — its jailbreak and injection probes generate exactly the adversarial corpus you need.

Bucket C: Ambiguous edge cases. Requests where reasonable people disagree about whether to refuse. Medical questions in a general assistant. Security research questions. Dual-use technical queries. The expected behavior is “refuse with explanation, offer to engage with the safe-mode version” — or whatever your product’s specific policy is. Track this bucket separately. It is where most of the visible quality regressions live.

Run the eval against your current configuration to establish a baseline. Most production assistants land somewhere between 5–15% false positive rate on bucket A, 1–5% false negative rate on bucket B, and 30–60% “wrong-shape refusal” rate on bucket C. Whether those numbers are good or bad depends on your product. The point is you cannot tune what you have not measured.

Convert false positive rates into business cost

A 10% false positive rate on bucket A means, in expectation, one in ten legitimate user requests gets refused. That number is meaningless until you tie it to revenue, retention, or task success.

The simplest model: pick a conversion-relevant downstream metric (task completion, session length, return rate) and run a holdout where a fraction of users hits a configuration with a different refusal threshold. Hold all else constant. Measure the metric difference. The slope of metric-per-percentage-point-of-FPR is the local price of refusals in your product.

This is the same shape of experiment you would run for any classifier threshold. The only twist for refusals: bucket B contamination. If you also reduce the false negative rate when you reduce the false positive rate (you usually do not, but check), you have to control for the operational cost of the harms you let through. Most teams undervalue this control and ship configurations that look better on user metrics and worse on harm rate.

Tune the threshold, not the model

Refusal behavior comes from two layers: the underlying model’s safety tuning, and any classifier-on-input or classifier-on-output you have stacked on top. The model’s tuning is a discrete choice — you have whatever the vendor shipped, or whatever your fine-tune produced. The classifier thresholds are continuous and tunable per-deployment.

Tune the classifiers. The most actionable lever is the threshold at which you trigger a refusal. Llama Guard ↗ returns a probability; you choose the cutoff. NeMo Guardrails composes rails; you choose which rails are blocking versus advisory. OpenAI Moderation returns per-category scores; you choose which categories block at what thresholds. For a production-level comparison of these three tools — latency, customization surface, and where each one breaks — see Llama Guard vs NeMo Guardrails vs OpenAI Moderation API: Production Tradeoffs.

Move thresholds in small increments. A swing from 0.5 to 0.3 in a moderation classifier can flip 5–10% of borderline requests from served to refused. That is not a small change. Run the eval suite on every adjustment, watch buckets A and B move together, and stop when the marginal harm rate reduction is no longer worth the marginal refusal rate increase.

Build the regression suite

After tuning, the threshold is only the current best answer. Model updates, prompt edits, and policy revisions will all shift the distribution. The regression suite is what catches the drift.

Run the eval on every model version pin change, every guardrail config change, every prompt edit. Alert on bucket A false positive rate increasing by more than a couple of points or bucket B false negative rate increasing by more than a fraction of a point. Treat the alerts the same way you treat latency regressions: a number going the wrong way is a release blocker until someone explains it.

The goal is not zero false positives. The goal is knowing what your rate is, knowing what it costs you, and noticing when it changes. The guardrail you tune is only one layer; for where it sits among pre-deployment scanners and runtime guards, see Best LLM Security Scanners: Open-Source and Enterprise Compared.

For more context, AI defense strategies ↗ covers related topics in depth.

False Positive Cost in Refusal Systems: Measure and Tune

Why false positives are hard to see

Build a refusal eval set before tuning anything

Convert false positive rates into business cost

Tune the threshold, not the model

Build the regression suite

Best LLM Scanners — in your inbox

Related

Choosing an LLM Guardrail: Llama Guard, NeMo Guardrails, Guardrails AI

LLM Guard: Input and Output Scanning for Production LLM Apps

Best LLM Security Scanners: Open-Source and Enterprise Compared

Comments