LLM Guard: Input and Output Scanning for Production LLM Apps

LLM Guard ↗ is an open-source, self-hosted security toolkit from Protect AI that sits in the request path of an LLM application and screens both directions: it scans user prompts before they reach the model, and it scans model responses before they reach the user. Unlike a pre-deployment scanner such as garak — which finds vulnerabilities before you ship — LLM Guard is a runtime guardrail. It runs on live traffic. That distinction governs everything about how you evaluate and deploy it.

LLM Guard is MIT-licensed, requires Python 3.9+, and runs entirely on your own infrastructure, which means prompts and responses never leave your environment — the dividing line that decides whether a hosted moderation API is even an option for regulated workloads.

The two-sided pipeline

LLM Guard splits its work into two scanner families that map onto the two halves of an LLM request.

Input scanners run on the user’s prompt before it reaches the model. The library ships a substantial set, including:

PromptInjection — detects prompt-injection attempts (the headline threat, OWASP LLM01)
Anonymize — redacts PII before it reaches the model
BanTopics, BanSubstrings, BanCompetitors, BanCode — policy filters for disallowed topics, strings, competitor mentions, or code
Toxicity, Sentiment — content classifiers on the inbound side
Secrets — catches API keys and credentials in user input
Gibberish, Language, InvisibleText, TokenLimit, Regex, Code — structural and formatting checks

Output scanners run on the model’s response before it ships to the user. This set is larger, including:

NoRefusal — detects whether the model refused (useful for measuring over-refusal)
Sensitive — flags sensitive data leakage in the output
Deanonymize — restores entities that Anonymize redacted on the way in
Toxicity, Bias, Gibberish, MaliciousURLs, FactualConsistency, Relevance — quality and safety classifiers
Code, JSON, Language, LanguageSame, Regex, ReadingTime, URLReachability, Sentiment — structural and policy checks

The asymmetry is deliberate. Input scanning faces adversarial obfuscation (the attacker controls the text); output scanning faces the model’s own behavior (a benign prompt can still yield a harmful, leaking, or off-topic response). The two need different scanners, and conflating them is a design mistake — a point we develop in the classifier-on-output pattern.

How a scan works

Each scanner takes the text and returns three things: a sanitized version of the text (some scanners modify it — Anonymize redacts, others pass through), a boolean for whether it’s valid, and a risk score. You compose scanners into a pipeline and decide your policy: block on any failure, block above a score threshold, or sanitize-and-continue.

A minimal input-scan pipeline looks like:

from llm_guard import scan_prompt
from llm_guard.input_scanners import Anonymize, PromptInjection, Toxicity

scanners = [Anonymize(), Toxicity(), PromptInjection()]
sanitized_prompt, results_valid, results_score = scan_prompt(scanners, prompt)

The output side mirrors it with scan_output, which additionally takes the original prompt so scanners like Relevance and Deanonymize have the context they need. Protect AI also ships LLM Guard as an API/container if you’d rather run it as a sidecar than import it as a library.

Where the latency goes

The thing teams underestimate is that several LLM Guard scanners are themselves ML models — PromptInjection, Toxicity, Bias, and others run transformer classifiers. That means LLM Guard adds real latency on every request, on top of the model call it protects, and the cost scales with how many model-backed scanners you enable. A pipeline with PromptInjection + Toxicity on input and Sensitive + Toxicity + FactualConsistency on output is several extra inferences per request.

Two practical consequences:

Enable only the scanners your threat model needs. Every model-backed scanner is latency you pay on the >99% of benign traffic too. The cheap pattern scanners (Regex, BanSubstrings, Secrets) are nearly free; the classifiers are not.
Benchmark the configured pipeline, not the published per-scanner numbers. The latency that matters is your composed pipeline at your concurrency, on your hardware. For methodology on measuring this honestly, aisecbench.com ↗ covers latency and cost as first-class benchmark metrics.

Failure modes worth pre-empting

LLM Guard’s classifiers inherit the failure mode of every safety classifier: over-refusal on benign-but-adjacent inputs. A PromptInjection scanner tuned for high recall will flag legitimate prompts that happen to contain instruction-like phrasing. You catch this only by running your own eval set through the configured pipeline before shipping and re-running it whenever you change a scanner or its threshold. For a structured method to measure the false-positive cost and tune thresholds without breaking your detection rate, see False Positive Cost in Production Refusal Systems: How to Measure and Tune.

The PromptInjection scanner is also not a complete defense on its own — no single detector is. Protect AI’s own framing positions LLM Guard as one layer in defense-in-depth, not a silver bullet. Pair it with model-side safety tuning and, where the threat model warrants, a dedicated guardrail model.

Where LLM Guard fits in the stack

LLM Guard occupies the runtime input/output screening slot. It complements, rather than replaces, the rest of a defensive stack:

Pre-deployment scanning (garak, PyRIT) finds the vulnerabilities before you ship; LLM Guard catches live attempts after. See our garak walkthrough and PyRIT explainer.
Guardrail models (Llama Guard) and frameworks (NeMo Guardrails) overlap with LLM Guard on content classification but differ in shape — a model vs a scanner library vs a framework. Our Llama Guard vs NeMo vs OpenAI Moderation comparison maps the tradeoffs, and guardml.io ↗ covers the broader guardrail landscape.

For where LLM Guard sits among the full set of open-source and enterprise options, see Best LLM Security Scanners: Open-Source and Enterprise Compared.

Practical Recommendation

Reach for LLM Guard when you want self-hosted, in-VPC input/output screening and you’d rather assemble a pipeline from a broad scanner library than adopt a single guardrail model or a programmable framework. Start with a minimal set — PromptInjection and Anonymize on input, Sensitive and NoRefusal on output — measure the latency and false-positive cost on your own traffic, and add scanners only as your threat model justifies the inference budget. Keep it as the runtime layer behind a pre-deployment scanner, not as your only line of defense. For more context on layering runtime guards with the rest of a defensive stack, aidefense.dev ↗ covers related strategies in depth.

Sources

protectai/llm-guard — GitHub ↗: Official repository. MIT-licensed, Python 3.9+. Full input- and output-scanner catalog and the scan_prompt / scan_output API.
LLM Guard — Protect AI ↗: Product page describing LLM Guard’s role as a self-hosted, in-infrastructure guardrail and its defense-in-depth positioning.
LLM Guard documentation (Protect AI) ↗: Per-scanner documentation for both the input and output sides, including configuration and thresholds.

LLM Guard: Input and Output Scanning for Production LLM Apps

The two-sided pipeline

How a scan works

Where the latency goes

Failure modes worth pre-empting

Where LLM Guard fits in the stack

Practical Recommendation

Sources

Sources

Best LLM Scanners — in your inbox

Related

Choosing an LLM Guardrail: Llama Guard, NeMo Guardrails, Guardrails AI

Classifier-on-Output: Catching Misbehavior Post-Generation

Llama Guard vs NeMo vs OpenAI Moderation: Production Tradeoffs

Comments