Classifier-on-Output: Catching Misbehavior Post-Generation

Input filtering and refusal training catch the obvious attacks. The interesting failures happen when the input passed every check, the model produced a fluent answer, and that answer was nevertheless wrong, harmful, or out-of-policy in a way no upstream gate could have predicted. Classifier-on-output is the pattern that addresses this. It is also the pattern most often skipped — usually for latency reasons, occasionally because teams don’t realize they need it until an output ships that gets them paged at 2am.

This post covers the architectures that work, the ones that don’t, and where the latency budget actually lives.

Why output classification is not redundant

The argument against classifier-on-output goes: “We already moderate the input, the model is RLHF’d to refuse, and we sanitize the response. Why add another model on the hot path?” The argument is wrong for three concrete reasons.

First, refusal training has measurable false-negative rates against adversarial inputs. Once an attacker has slipped past the input filter — typically with a jailbreak that the input classifier did not recognize — the only thing standing between the model and the user is the model’s own safety tuning. That tuning was trained against a fixed distribution of attacks. It generalizes imperfectly. Classifier-on-output catches the cases where the model complied with a successful jailbreak even though it “should” have refused.

Second, hallucinated harmful content is not blocked by refusal training at all. The model is not refusing; the model is wrong in a dangerous way. A medical chatbot confidently citing a contraindication that does not exist. A legal-summary model inventing case law. A code assistant generating a snippet that, when run, deletes user data. Refusal logic does not fire because nothing about the input looked harmful. Only an output-side check has any chance of catching this.

Third, policy drift between model versions is invisible without output monitoring. The model you shipped last month and the model you shipped this week may disagree on whether a borderline input warrants a refusal. An output classifier gives you a measurable signal — refusal rate, harm score distribution, category mix — that surfaces drift before users do.

The three architectures that actually work

Inline blocking. The classifier runs on the full generated response before the response is returned. If the classifier flags, the response is replaced with a refusal template or regenerated with stricter constraints. Latency cost is significant — you serialize generation and classification. Use this when the cost of shipping a bad output is higher than the cost of an extra 100–300 ms of latency. Most consumer chat surfaces qualify.

Streaming with cutoff. The classifier runs on partial outputs as tokens stream. When it flags, generation halts and a refusal replaces the partial response. This is the architecture that ships in most production assistants today. It hides the classifier latency inside the perceived response time. The implementation is harder than it looks: the classifier must be cheap enough to run on rolling buffers, and your UI has to be able to swap the streaming text for a refusal cleanly without user-visible artifacts. Llama Guard ↗ 3 1B and similar small models are sized for exactly this pattern — see Llama Guard vs NeMo Guardrails vs OpenAI Moderation API: Production Tradeoffs for a latency and coverage comparison of the leading runtime classifier options.

Post-hoc review. The classifier runs after the response ships, in async logging. Flagged outputs go to a review queue or trigger alerting. This is the only architecture that costs zero latency. It catches nothing for the user who got the bad response — but it gives you the data to improve the upstream guardrails and the model itself. For internal tools or anywhere the harm tolerance is higher, post-hoc is enough. For consumer-facing or regulated surfaces, post-hoc is a complement to inline or streaming, never a replacement.

What to classify against

Generic toxicity is the lowest-value classifier you can run. It is mostly solved upstream and rarely fires on the failures that matter. Higher-value classes:

Domain-specific harmful content — medical advice when the system is not licensed to give it, legal advice in jurisdictions where it triggers liability, financial guidance that violates regulatory carve-outs.
PII echoing — the model regurgitating user-supplied PII, training-data PII, or system-context PII back into the response.
Tool-call misuse — for agentic systems, classifying whether a proposed tool call is consistent with policy before executing.
Hallucinated citations — claims that look like references but resolve to nothing real. A small retrieval-augmented classifier can flag these.
Prompt-injection success signatures — phrases that indicate the model accepted an injected instruction (acknowledging an alternate persona, ignoring prior instructions, switching languages mid-response).

The right mix depends on your domain. The wrong move is to ship a generic-toxicity classifier and assume the problem is solved. Several of these classes ship as ready-made output scanners — Sensitive, NoRefusal, FactualConsistency, MaliciousURLs — in LLM Guard’s output-scanning pipeline, and the prompt-injection success signatures are the output-side complement to the input detectors compared in Prompt Injection Detectors: Rebuff vs Vigil vs LLM Guard.

What this costs

Budget 30–100 ms of latency for a small inline classifier on a modest GPU, or 5–15 ms for a distilled BERT-class classifier on CPU. Streaming reduces the user-visible cost to roughly zero if your generation is fast enough that the classifier finishes before the next chunk would have shipped. Post-hoc costs only compute, and the queue can absorb load spikes.

The bigger cost is operational. Output classifiers need eval sets, regression monitoring, and a tuning loop for false positives. The next post in this series gets specific about how to measure that: False Positive Cost in Production Refusal Systems: How to Measure and Tune. For where output classification sits among the full set of scanners and runtime guards, see Best LLM Security Scanners: Open-Source and Enterprise Compared.

For more context, AI defense strategies ↗ covers related topics in depth.