LLM Guard: Input and Output Scanning for Production LLM Apps
A practical breakdown of LLM Guard by Protect AI — its input and output scanners, how the sanitize/scan pipeline works, where it fits as a runtime guardrail, and its real limits.
LLM Guard ↗ is an open-source, self-hosted security toolkit from Protect AI that sits in the request path of an LLM application and screens both directions: it scans user prompts before they reach the model, and it scans model responses before they reach the user. Unlike a pre-deployment scanner such as garak — which finds vulnerabilities before you ship — LLM Guard is a runtime guardrail. It runs on live traffic. That distinction governs everything about how you evaluate and deploy it.
LLM Guard is MIT-licensed, requires Python 3.9+, and runs entirely on your own infrastructure, which means prompts and responses never leave your environment — the dividing line that decides whether a hosted moderation API is even an option for regulated workloads.
The two-sided pipeline
LLM Guard splits its work into two scanner families that map onto the two halves of an LLM request.
Input scanners run on the user’s prompt before it reaches the model. The library ships a substantial set, including:
- PromptInjection — detects prompt-injection attempts (the headline threat, OWASP LLM01)
- Anonymize — redacts PII before it reaches the model
- BanTopics, BanSubstrings, BanCompetitors, BanCode — policy filters for disallowed topics, strings, competitor mentions, or code
- Toxicity, Sentiment — content classifiers on the inbound side
- Secrets — catches API keys and credentials in user input
- Gibberish, Language, InvisibleText, TokenLimit, Regex, Code — structural and formatting checks
Output scanners run on the model’s response before it ships to the user. This set is larger, including:
- NoRefusal — detects whether the model refused (useful for measuring over-refusal)
- Sensitive — flags sensitive data leakage in the output
- Deanonymize — restores entities that Anonymize redacted on the way in
- Toxicity, Bias, Gibberish, MaliciousURLs, FactualConsistency, Relevance — quality and safety classifiers
- Code, JSON, Language, LanguageSame, Regex, ReadingTime, URLReachability, Sentiment — structural and policy checks
The asymmetry is deliberate. Input scanning faces adversarial obfuscation (the attacker controls the text); output scanning faces the model’s own behavior (a benign prompt can still yield a harmful, leaking, or off-topic response). The two need different scanners, and conflating them is a design mistake — a point we develop in the classifier-on-output pattern.
How a scan works
Each scanner takes the text and returns three things: a sanitized version of the text (some scanners modify it — Anonymize redacts, others pass through), a boolean for whether it’s valid, and a risk score. You compose scanners into a pipeline and decide your policy: block on any failure, block above a score threshold, or sanitize-and-continue.
A minimal input-scan pipeline looks like:
from llm_guard import scan_prompt
from llm_guard.input_scanners import Anonymize, PromptInjection, Toxicity
scanners = [Anonymize(), Toxicity(), PromptInjection()]
sanitized_prompt, results_valid, results_score = scan_prompt(scanners, prompt)
The output side mirrors it with scan_output, which additionally takes the original prompt so scanners like Relevance and Deanonymize have the context they need. Protect AI also ships LLM Guard as an API/container if you’d rather run it as a sidecar than import it as a library.
Where the latency goes
The thing teams underestimate is that several LLM Guard scanners are themselves ML models — PromptInjection, Toxicity, Bias, and others run transformer classifiers. That means LLM Guard adds real latency on every request, on top of the model call it protects, and the cost scales with how many model-backed scanners you enable. A pipeline with PromptInjection + Toxicity on input and Sensitive + Toxicity + FactualConsistency on output is several extra inferences per request.
Two practical consequences:
- Enable only the scanners your threat model needs. Every model-backed scanner is latency you pay on the >99% of benign traffic too. The cheap pattern scanners (Regex, BanSubstrings, Secrets) are nearly free; the classifiers are not.
- Benchmark the configured pipeline, not the published per-scanner numbers. The latency that matters is your composed pipeline at your concurrency, on your hardware. For methodology on measuring this honestly, aisecbench.com ↗ covers latency and cost as first-class benchmark metrics.
Failure modes worth pre-empting
LLM Guard’s classifiers inherit the failure mode of every safety classifier: over-refusal on benign-but-adjacent inputs. A PromptInjection scanner tuned for high recall will flag legitimate prompts that happen to contain instruction-like phrasing. You catch this only by running your own eval set through the configured pipeline before shipping and re-running it whenever you change a scanner or its threshold. For a structured method to measure the false-positive cost and tune thresholds without breaking your detection rate, see False Positive Cost in Production Refusal Systems: How to Measure and Tune.
The PromptInjection scanner is also not a complete defense on its own — no single detector is. Protect AI’s own framing positions LLM Guard as one layer in defense-in-depth, not a silver bullet. Pair it with model-side safety tuning and, where the threat model warrants, a dedicated guardrail model.
Where LLM Guard fits in the stack
LLM Guard occupies the runtime input/output screening slot. It complements, rather than replaces, the rest of a defensive stack:
- Pre-deployment scanning (garak, PyRIT) finds the vulnerabilities before you ship; LLM Guard catches live attempts after. See our garak walkthrough and PyRIT explainer.
- Guardrail models (Llama Guard) and frameworks (NeMo Guardrails) overlap with LLM Guard on content classification but differ in shape — a model vs a scanner library vs a framework. Our Llama Guard vs NeMo vs OpenAI Moderation comparison maps the tradeoffs, and guardml.io ↗ covers the broader guardrail landscape.
For where LLM Guard sits among the full set of open-source and enterprise options, see Best LLM Security Scanners: Open-Source and Enterprise Compared.
Practical Recommendation
Reach for LLM Guard when you want self-hosted, in-VPC input/output screening and you’d rather assemble a pipeline from a broad scanner library than adopt a single guardrail model or a programmable framework. Start with a minimal set — PromptInjection and Anonymize on input, Sensitive and NoRefusal on output — measure the latency and false-positive cost on your own traffic, and add scanners only as your threat model justifies the inference budget. Keep it as the runtime layer behind a pre-deployment scanner, not as your only line of defense. For more context on layering runtime guards with the rest of a defensive stack, aidefense.dev ↗ covers related strategies in depth.
Sources
- protectai/llm-guard — GitHub ↗: Official repository. MIT-licensed, Python 3.9+. Full input- and output-scanner catalog and the
scan_prompt/scan_outputAPI. - LLM Guard — Protect AI ↗: Product page describing LLM Guard’s role as a self-hosted, in-infrastructure guardrail and its defense-in-depth positioning.
- LLM Guard documentation (Protect AI) ↗: Per-scanner documentation for both the input and output sides, including configuration and thresholds.
Sources
Best LLM Scanners — in your inbox
Comparing LLM security scanners and detection tools. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Choosing an LLM Guardrail: Llama Guard, NeMo Guardrails, Guardrails AI
A decision guide for picking an LLM guardrail in 2026 — Meta's Llama Guard 4, NVIDIA's NeMo Guardrails, and Guardrails AI. What each one actually is, and which shape fits your problem.
Classifier-on-Output: Catching Misbehavior Post-Generation
How production teams use post-generation classifiers to catch what input filters and refusal training miss — architectures, tradeoffs, and where output classifiers earn their latency budget.
Llama Guard vs NeMo vs OpenAI Moderation: Production Tradeoffs
A practitioner comparison of Llama Guard, NeMo Guardrails, and the OpenAI Moderation API — coverage, latency, customization, and where each one breaks in production.