Best LLM Scanners
Lines of programming code on a screen, representing the garak vulnerability scanner CLI workflow
tools

Garak LLM Vulnerability Scanner: How It Works and When to Use It

A technical breakdown of the garak LLM vulnerability scanner — its probe architecture, supported attack categories, CLI workflow, and how it fits into a real AI red-teaming pipeline.

By Best LLM Scanners Editorial · · 8 min read

The garak LLM vulnerability scanner is the closest thing the industry has to a Nessus-style probe suite for language models. Where traditional network scanners enumerate CVEs against service banners, garak fires structured adversarial prompts at a target LLM, collects responses, and runs a battery of detectors to determine whether the model exhibited unsafe behavior. The result is a scored audit trail — JSONL logs plus an HTML report — that tells you which attack classes succeeded, at what rate, and against which inputs.

Garak stands for Generative AI Red-teaming & Assessment Kit. It is open-source (Apache 2.0), backed by NVIDIA, and described in a peer-reviewed framework paper by Derczynski et al. The current stable release as of May 2026 is v0.15.0. If you are evaluating, deploying, or auditing an LLM-powered system and haven’t run garak against it, you have a gap in your security posture.

Architecture: Generators, Probes, Detectors, Buffs

Garak’s design maps cleanly onto cybersecurity concepts most practitioners already know.

Generators are adapters that connect garak to a target model. The framework ships with generators for OpenAI-compatible APIs, Hugging Face model IDs, AWS Bedrock, Replicate, Cohere, Groq, NVIDIA NIM endpoints, local GGUF models, and any REST-accessible endpoint via a configurable JSON adapter. Pointing garak at a Databricks-hosted model, for example, requires only a JSON config file specifying the endpoint URL, authentication token, and the JSONPath expression for parsing the response.

Probes are the attack modules. Each probe encapsulates a specific threat category and ships with a corpus of adversarial prompts. Garak currently includes 37+ probe modules covering:

  • Jailbreaks: DAN (Do Anything Now) variants including DAN 6.0, 11.0, and AutoDAN; GCG (Greedy Coordinate Gradient) adversarial suffixes; PAIR and TAP attack strategies
  • Prompt injection: indirect injection via tool outputs, document context, and system-prompt override attempts
  • Toxicity: real toxicity prompts drawn from validated datasets; language model risk (LMRC) probes
  • Hallucination: snowballed hallucination chains; package hallucination (fabricated PyPI/npm dependency names that could enable supply chain attacks)
  • Data leakage: training data replay; membership inference attacks
  • Encoding-based attacks: Base64, ROT13, Zalgo text, and other obfuscation layers that can bypass safety filters
  • Malware generation: code-focused probes checking whether a model will generate functional exploit code
  • XSS and data exfiltration: testing for cross-site scripting in LLM-generated output, relevant to agentic systems that render HTML

Detectors analyze model outputs after each probe. Depending on the attack class, detectors use keyword matching, regex patterns, machine learning classifiers, or a separate LLM acting as a judge. Each probe has a primary_detector plus optional extended detectors. This layered approach reduces both false positives (a keyword match that isn’t actually harmful) and false negatives (harmful content that doesn’t trigger simple pattern rules).

Buffs are prompt transformation layers that run before probes are dispatched. Backtranslation, paraphrasing, and encoding variations act as fuzzing mechanisms — they increase attack surface coverage by testing whether safety measures are robust to surface-level reformulation of the same attack intent. A model that refuses a direct jailbreak but complies with a Base64-encoded version of the same prompt has a real vulnerability, and buffs surface that.

Running a Scan

Installation is a single pip command:

python -m pip install -U garak

A minimal scan against a local Hugging Face model looks like:

python -m garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0

For production API models, you configure a generator in a JSON file and pass it via --generator_option_file. The --probes flag accepts comma-separated module names or all to run the full suite. A full scan can issue thousands of prompts — the corpus spans over 3,000 test cases across probe categories — so budget accordingly for API costs and rate limits.

Garak writes three output artifacts per run: a .jsonl hit log (every prompt, response, and detector verdict), a .jsonl report aggregating pass/fail rates, and an HTML report introduced in v0.14.0 that renders results in a browser-readable format. These artifacts make garak suitable for inclusion in CI/CD pipelines: a shell script can parse the JSONL report, check whether any probe exceeded a configurable failure threshold, and fail the pipeline accordingly.

For teams looking to compare garak’s output against established safety benchmarks, aisecbench.com tracks evaluation frameworks that can complement automated probe results with standardized scoring rubrics.

What the Results Actually Tell You

A Databricks engineering team ran garak against a hosted LLM and found that DAN-class jailbreaks succeeded on every single attempt across five repetitions — a 100% attack success rate against that specific model configuration. That finding is the kind of concrete, actionable signal garak is designed to surface. It does not mean the model is unusable; it means the deployment configuration needs a guardrail layer between the model and user input before it goes to production.

This is garak’s primary role: pre-deployment red teaming. It scans for exploitable behavior before you ship. It does not provide runtime protection. The distinction matters. For runtime defense — input screening, output filtering, real-time anomaly detection — you need a separate layer. guardml.io covers the guardrail and content-filtering tools that sit alongside a scanner like garak in a complete defensive stack.

For the attack techniques that garak probes are built around — the mechanics of DAN jailbreaks, indirect prompt injection, and GCG adversarial suffixes — aisec.blog provides operational breakdowns that help practitioners understand what garak is actually testing and why specific probe categories matter for their threat model.

Probe Coverage and Known Gaps

Garak is strong on known, documented attack classes. Its probe library maps well to the OWASP Top 10 for LLMs — particularly LLM01 (prompt injection), LLM02 (insecure output handling), LLM06 (sensitive information disclosure), and LLM09 (misinformation). The adaptive attack generation module (atkgen) can learn from successful probes and synthesize novel test cases, pushing garak slightly toward the red-teaming end of the spectrum rather than pure static-corpus scanning.

Coverage gaps are worth acknowledging. Garak does not yet have deep coverage for agentic multi-step attacks — scenarios where an LLM-powered agent takes sequential actions across tool calls. That gap reflects the broader state of the field; frameworks targeting agentic exploitation are still maturing. Garak also does not test model weights for backdoors or data-poisoning artifacts — that requires a different class of tooling.

For a narrower, code-generation-focused workflow, promptfoo is a faster alternative with tight CI integration. For teams that need multi-agent red teaming, Microsoft’s PyRIT covers some of the agentic territory garak currently leaves open. Garak’s advantage is breadth of classical LLM attack coverage and the quality of its peer-reviewed probe corpus. For a side-by-side comparison of garak, PyRIT, Promptfoo, LLM Guard, and runtime options as a complete defensive stack, see Best LLM Security Scanners: Open-Source and Enterprise Tools Compared.

Practical Recommendation

Run garak as a gate in your model evaluation pipeline, not a one-time audit. Models change when fine-tuned, when system prompts are updated, or when the underlying base model is upgraded. A probe suite that passes today can fail after a system prompt change. Automate the scan, version the JSONL reports alongside your model artifacts, and set explicit pass/fail thresholds per probe category before those thresholds are negotiated under pressure after an incident. For guidance on measuring and maintaining those thresholds over time — including how to build the eval set and convert false positive rates into business cost — see False Positive Cost in Production Refusal Systems: How to Measure and Tune.


Sources

Sources

  1. NVIDIA/garak: the LLM vulnerability scanner — GitHub
  2. garak: A Framework for Security Probing Large Language Models — arXiv
  3. AI Security in Action: Applying NVIDIA's Garak to LLMs on Databricks
  4. garak: LLM vulnerability scanner — official site
Subscribe

Best LLM Scanners — in your inbox

Comparing LLM security scanners and detection tools. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments