Automated LLM Red-Teaming in CI: garak vs PyRIT vs Promptfoo
Three open-source tools can gate your pipeline on LLM security findings — garak, PyRIT, and Promptfoo. A practitioner comparison of how each fits CI/CD, what it scans, and which to run where.
The point of automated LLM red-teaming is to fail a build when a model or prompt change introduces a security regression — before it ships, not after an incident. Three open-source tools can play that gate: NVIDIA’s garak, Microsoft’s PyRIT, and Promptfoo. They overlap enough to be confused and differ enough that the right answer is usually “more than one.” This guide compares how each fits a CI/CD pipeline, what it actually scans, and where each earns its place.
The CI gate, defined
A useful CI security gate has four properties: it runs unattended, it produces a machine-parseable result, it lets you set pass/fail thresholds, and it’s fast enough that engineers don’t route around it. Hold the three tools against that bar.
garak: the breadth-first probe scanner
garak ↗ (NVIDIA, Apache-2.0) is the closest thing LLM security has to a Nessus-style scanner: a large library of pre-built probes covering jailbreaks, prompt injection, toxicity, hallucination, data leakage, and encoding attacks, run with a single command. For CI, its strengths are exactly the ones that matter:
- One-command invocation.
python -m garak --target_type ... --probes ...runs unattended with no code. - Machine-parseable output. garak writes JSONL hit logs and reports plus an HTML report; a shell step can parse the JSONL, check whether any probe category exceeded a threshold, and fail the build.
- Targeted subsets. Running
allprobes takes hours; in CI you run a focused subset to bring scan time down to minutes, then run the full suite nightly.
garak is the natural always-on regression gate: pin a probe subset, set per-category failure thresholds, and run it on every model or system-prompt change. Our garak walkthrough covers the probe/detector/generator architecture in depth. Its limit for CI is that it’s a fixed-corpus scanner — strong on known attack classes, thinner on multi-turn and bespoke campaigns.
PyRIT: the programmable campaign
PyRIT ↗ (Microsoft, MIT) is a red-teaming SDK, not a one-command scanner. You compose targets, datasets, orchestrators, converters, and scorers in Python. That makes it more work to wire into CI — there’s no single CLI gate out of the box — but it buys two things garak can’t:
- Multi-turn attack strategies (Crescendo, TAP, Skeleton Key) that carry conversation state, catching vulnerabilities a single-shot scanner structurally can’t.
- Custom scorers tied to your own policy, so the pass/fail signal reflects your definition of harm rather than a generic detector’s.
In CI, PyRIT fits as a scheduled campaign rather than a per-commit gate: a committed Python harness (orchestrators, datasets, scorers in version control) run on a schedule, emitting scores your pipeline thresholds against. Our PyRIT explainer covers the architecture. The cost is that you own the integration glue; the benefit is depth garak doesn’t reach.
Promptfoo: red-teaming built for the pipeline
Promptfoo ↗ (MIT-licensed; the project is now part of OpenAI and remains open source) was designed CI-first. It started as a prompt/RAG evaluation tool and grew a red-team mode that auto-generates adversarial prompts using a large library of attack plugins — prompt injection, jailbreaks, PII leakage, excessive agency, and many more — driven by a declarative YAML config rather than code.
For CI specifically, Promptfoo is the most turnkey of the three:
- Declarative config. The whole eval/red-team run is a YAML file you commit — no harness code.
- First-class CI/CD integration, including a GitHub Action for red-team scanning, so a failing finding blocks a pull request directly.
- Compliance mappings. Its red-team presets map to the OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS, which is useful when the gate has to produce an auditable report, not just a pass/fail.
Promptfoo’s sweet spot is teams that want red-teaming wired into pull-request CI with minimal custom code and a compliance-flavored report at the end.
Side-by-side for CI
| garak | PyRIT | Promptfoo | |
|---|---|---|---|
| Maintainer | NVIDIA | Microsoft | Promptfoo (part of OpenAI) |
| License | Apache-2.0 | MIT | MIT |
| Shape | Probe scanner (CLI) | Red-team SDK (Python) | Eval + red-team (declarative + CLI) |
| CI integration | JSONL parse + threshold | Custom harness, scheduled | Native GitHub Action |
| Multi-turn attacks | Limited | Strong (Crescendo, TAP) | Plugin-based |
| Config style | CLI flags / option files | Python code | YAML |
| Compliance reporting | Manual | Manual | OWASP / NIST / MITRE presets |
| Best CI role | Per-commit regression gate | Scheduled deep campaign | PR-blocking red-team gate |
They’re complementary, not exclusive
The mature pipeline uses more than one, mapped to where each is strong:
- Promptfoo or garak as the fast per-commit / per-PR gate — declarative and CI-native (Promptfoo) or one-command with threshold parsing (garak).
- garak full suite nightly, for breadth the per-commit subset skips.
- PyRIT on a schedule for the deep, multi-turn, custom-scored campaigns that need conversation state and your own policy.
All three are pre-deployment tools. None provides runtime protection — for live input/output screening you need a separate layer like LLM Guard or a guardrail model, covered in our guardrail selection guide and at guardml.io ↗. And a gate is only as good as its thresholds: set per-category pass/fail bars deliberately, because they will otherwise be negotiated under pressure after an incident.
The threshold problem
A red-team CI gate fails builds, which means a badly tuned gate either lets regressions through (thresholds too loose) or blocks every release on noise (thresholds too tight). Two disciplines keep it honest:
- Pin the target model to a dated snapshot. A gate that runs against a floating model alias will flip pass/fail when the provider silently updates the model, and you’ll waste days chasing a “regression” you didn’t cause. aisecbench.com ↗ covers the reproducibility discipline these gates depend on.
- Set thresholds against measured false-positive cost. An attack-detection threshold that’s too aggressive blocks legitimate releases; our false-positive cost guide covers turning detection rates into a defensible bar.
For the attack techniques all three tools generate, aisec.blog ↗ breaks down the mechanics.
Practical Recommendation
If you want the fastest path to a PR-blocking gate with a compliance report, start with Promptfoo — declarative YAML and a native GitHub Action make it the lowest-friction CI option. If you’re already running garak, keep it as the per-commit regression gate (subset) plus a nightly full suite; its JSONL output parses cleanly into a threshold check. Add PyRIT as a scheduled deep campaign when your threat model needs multi-turn strategies or custom scoring tied to your policy. Pin the target to a dated snapshot in every case, and set per-category thresholds before they matter. For where these scanners sit in a complete stack, see Best LLM Security Scanners: Open-Source and Enterprise Compared, and aidefense.dev ↗ for surrounding defense strategy.
Sources
- NVIDIA/garak — GitHub ↗: LLM vulnerability scanner. Apache-2.0. Single-command probe runs, JSONL/HTML output suitable for CI threshold parsing.
- microsoft/PyRIT — GitHub ↗: Python Risk Identification Tool. MIT. Programmable orchestrators with multi-turn strategies (Crescendo, TAP, Skeleton Key) and custom scorers.
- promptfoo/promptfoo — GitHub ↗: Test and red-team LLM apps. MIT-licensed; now part of OpenAI and remains open source. Declarative configs, CI/CD and GitHub Action integration, attack plugins with OWASP/NIST/MITRE mappings.
Sources
Best LLM Scanners — in your inbox
Comparing LLM security scanners and detection tools. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
PyRIT: Microsoft's AI Red-Teaming Framework, Explained
A technical breakdown of PyRIT, Microsoft's Python Risk Identification Tool for generative AI — its target/dataset/orchestrator/converter/scorer architecture, multi-turn attack strategies, and where it fits next to garak.
Garak LLM Vulnerability Scanner: How It Works and When to Use It
A technical breakdown of the garak LLM vulnerability scanner — its probe architecture, supported attack categories, CLI workflow, and how it fits into a real AI red-teaming pipeline.
Best LLM Security Scanners: Open-Source and Enterprise Compared
A practitioner's comparison of the best LLM security scanners — Garak, PyRIT, LLM Guard, Promptfoo, Vigil, and enterprise options. Coverage, CI/CD fit, and runtime use cases.