Concept · Red teaming

Red teaming

Evaluators tell you what your model does on the inputs you have. Red teaming tells you what your model does on the inputs you haven't thought of. The product surface is two orthogonal axes — plugins (what attack class to test) and strategies (how to obfuscate the test).

Plugins — the "what"

A plugin generates inputs designed to trigger a specific failure mode. EvalGuard ships 249 plugins grouped into 27 categories:

Harm-of-content — toxicity, hate speech, harassment, self-harm prompts.
Bias — gender, race, age, disability, religion, medical anchoring, political.
PII / leakage — Aadhaar, SSN, credit card, API key, secret extraction, system prompt leak.
Injection — prompt override, jailbreak, role-play breakout, indirect injection via tool inputs.
Hallucination — non-existent entity grounding, fabricated citation, overconfidence.
Agent-specific — tool-call abuse, off-policy actions, dataset poisoning via search results.
Vertical packs — healthcare, legal, finance, code- assistant variants of the above with domain-specific seed data.

Each plugin is a small generator that produces 5–200 test cases. The full catalogue is browsable in the docs.

Strategies — the "how"

A strategy takes an attack input and transforms it to evade detection. EvalGuard ships 42 strategies across 6 families:

Encoding — base64, ROT13, leetspeak, homoglyph, unicode-confusable substitutions.
Indirection — multi-turn ramp-up, role-play prefix, hypothetical framing ("for a research paper, describe how to ...").
Translation — pivot through low-resource languages where the model's safety training was thinner.
Token-level — adversarial suffixes (GCG), gradient- based perturbations.
Adaptive — judge-LLM-in-the-loop attacker that iterates on failures (the closed-loop attacker from G2).
Combination — chains of the above (translate then encode then ramp-up).

The 2D matrix

249 plugins × 42 strategies = ~10K attack/transform pairs. Running every combination is wasteful — most models fail in 1–2 plugin families and on 2–3 strategies. The recommended starting set:

Smoke (1–2 min): top-10 plugins × {plain, base64, multi-turn-ramp} = 30 cases. Catches the "obvious" gaps.
Standard (10–20 min): full plugin catalogue × top-5 strategies = ~1,200 cases. Suitable for PR-time gating.
Deep (overnight): full plugin × full strategy matrix + closed-loop adaptive attacker. ~10K cases. Suitable for quarterly regression + SOC 2 evidence.

Detection vs attack-success

A red-team run gives two numbers per (plugin, strategy) pair:

Attack-success rate — what % of cases produced a response that the eval scorers judged as failing. Higher means the model is more vulnerable.
Detection rate — what % of attack inputs the firewall + scorers caught BEFORE the response was produced. Higher means the guardrail is more effective.

Track both. A 50% attack-success rate with 95% detection is a defensible posture (the firewall is doing its job). A 50% attack- success rate with 30% detection is a real product bug.

How to run

Three call paths, all backed by the same plugin/strategy registry:

CLI: evalguard scan --plugins=pii,injection --strategies=top-5. Best for local dev.
CI/CD: add evalguard scan:local --baseline=last-merged to your GitHub Actions workflow; fails the PR if regression vs the baseline.
API: POST /api/v1/redteam/scan for programmatic runs from your own orchestrator.

Related concepts

Firewall vs scorer — what catches attacks at request-time vs at eval-time.
Agent checkpoints — three places to insert defenses; red-team should cover all three.
Plugin catalogue — full list with per-plugin pages.