Best LLM Scanners
A security shield over flowing data, representing LLM Guard input and output scanning
guardrails

LLM Guard: Input and Output Scanning for Production LLM Apps

A practical breakdown of LLM Guard by Protect AI — its input and output scanners, how the sanitize/scan pipeline works, where it fits as a runtime guardrail, and its real limits.

By Best LLM Scanners Editorial · · 8 min read

LLM Guard is an open-source, self-hosted security toolkit from Protect AI that sits in the request path of an LLM application and screens both directions: it scans user prompts before they reach the model, and it scans model responses before they reach the user. Unlike a pre-deployment scanner such as garak — which finds vulnerabilities before you ship — LLM Guard is a runtime guardrail. It runs on live traffic. That distinction governs everything about how you evaluate and deploy it.

LLM Guard is MIT-licensed, requires Python 3.9+, and runs entirely on your own infrastructure, which means prompts and responses never leave your environment — the dividing line that decides whether a hosted moderation API is even an option for regulated workloads.

The two-sided pipeline

LLM Guard splits its work into two scanner families that map onto the two halves of an LLM request.

Input scanners run on the user’s prompt before it reaches the model. The library ships a substantial set, including:

  • PromptInjection — detects prompt-injection attempts (the headline threat, OWASP LLM01)
  • Anonymize — redacts PII before it reaches the model
  • BanTopics, BanSubstrings, BanCompetitors, BanCode — policy filters for disallowed topics, strings, competitor mentions, or code
  • Toxicity, Sentiment — content classifiers on the inbound side
  • Secrets — catches API keys and credentials in user input
  • Gibberish, Language, InvisibleText, TokenLimit, Regex, Code — structural and formatting checks

Output scanners run on the model’s response before it ships to the user. This set is larger, including:

  • NoRefusal — detects whether the model refused (useful for measuring over-refusal)
  • Sensitive — flags sensitive data leakage in the output
  • Deanonymize — restores entities that Anonymize redacted on the way in
  • Toxicity, Bias, Gibberish, MaliciousURLs, FactualConsistency, Relevance — quality and safety classifiers
  • Code, JSON, Language, LanguageSame, Regex, ReadingTime, URLReachability, Sentiment — structural and policy checks

The asymmetry is deliberate. Input scanning faces adversarial obfuscation (the attacker controls the text); output scanning faces the model’s own behavior (a benign prompt can still yield a harmful, leaking, or off-topic response). The two need different scanners, and conflating them is a design mistake — a point we develop in the classifier-on-output pattern.

How a scan works

Each scanner takes the text and returns three things: a sanitized version of the text (some scanners modify it — Anonymize redacts, others pass through), a boolean for whether it’s valid, and a risk score. You compose scanners into a pipeline and decide your policy: block on any failure, block above a score threshold, or sanitize-and-continue.

A minimal input-scan pipeline looks like:

from llm_guard import scan_prompt
from llm_guard.input_scanners import Anonymize, PromptInjection, Toxicity

scanners = [Anonymize(), Toxicity(), PromptInjection()]
sanitized_prompt, results_valid, results_score = scan_prompt(scanners, prompt)

The output side mirrors it with scan_output, which additionally takes the original prompt so scanners like Relevance and Deanonymize have the context they need. Protect AI also ships LLM Guard as an API/container if you’d rather run it as a sidecar than import it as a library.

Where the latency goes

The thing teams underestimate is that several LLM Guard scanners are themselves ML models — PromptInjection, Toxicity, Bias, and others run transformer classifiers. That means LLM Guard adds real latency on every request, on top of the model call it protects, and the cost scales with how many model-backed scanners you enable. A pipeline with PromptInjection + Toxicity on input and Sensitive + Toxicity + FactualConsistency on output is several extra inferences per request.

Two practical consequences:

  1. Enable only the scanners your threat model needs. Every model-backed scanner is latency you pay on the >99% of benign traffic too. The cheap pattern scanners (Regex, BanSubstrings, Secrets) are nearly free; the classifiers are not.
  2. Benchmark the configured pipeline, not the published per-scanner numbers. The latency that matters is your composed pipeline at your concurrency, on your hardware. For methodology on measuring this honestly, aisecbench.com covers latency and cost as first-class benchmark metrics.

Failure modes worth pre-empting

LLM Guard’s classifiers inherit the failure mode of every safety classifier: over-refusal on benign-but-adjacent inputs. A PromptInjection scanner tuned for high recall will flag legitimate prompts that happen to contain instruction-like phrasing. You catch this only by running your own eval set through the configured pipeline before shipping and re-running it whenever you change a scanner or its threshold. For a structured method to measure the false-positive cost and tune thresholds without breaking your detection rate, see False Positive Cost in Production Refusal Systems: How to Measure and Tune.

The PromptInjection scanner is also not a complete defense on its own — no single detector is. Protect AI’s own framing positions LLM Guard as one layer in defense-in-depth, not a silver bullet. Pair it with model-side safety tuning and, where the threat model warrants, a dedicated guardrail model.

Where LLM Guard fits in the stack

LLM Guard occupies the runtime input/output screening slot. It complements, rather than replaces, the rest of a defensive stack:

  • Pre-deployment scanning (garak, PyRIT) finds the vulnerabilities before you ship; LLM Guard catches live attempts after. See our garak walkthrough and PyRIT explainer.
  • Guardrail models (Llama Guard) and frameworks (NeMo Guardrails) overlap with LLM Guard on content classification but differ in shape — a model vs a scanner library vs a framework. Our Llama Guard vs NeMo vs OpenAI Moderation comparison maps the tradeoffs, and guardml.io covers the broader guardrail landscape.

For where LLM Guard sits among the full set of open-source and enterprise options, see Best LLM Security Scanners: Open-Source and Enterprise Compared.

Practical Recommendation

Reach for LLM Guard when you want self-hosted, in-VPC input/output screening and you’d rather assemble a pipeline from a broad scanner library than adopt a single guardrail model or a programmable framework. Start with a minimal set — PromptInjection and Anonymize on input, Sensitive and NoRefusal on output — measure the latency and false-positive cost on your own traffic, and add scanners only as your threat model justifies the inference budget. Keep it as the runtime layer behind a pre-deployment scanner, not as your only line of defense. For more context on layering runtime guards with the rest of a defensive stack, aidefense.dev covers related strategies in depth.


Sources

Sources

  1. protectai/llm-guard — GitHub
  2. LLM Guard — Protect AI
  3. LLM Guard documentation (Protect AI)
Subscribe

Best LLM Scanners — in your inbox

Comparing LLM security scanners and detection tools. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments