Full-Stack LLM Agent Security

Test, red-team, and monitor LLM agents in production.

Find every way your LLM agent can be jailbroken or leak data — before your customers do — then ship the compliance evidence to prove it. BYOK, self-hostable, no vendor lock-in.

Eval Prompt IDE Firewall Gateway Observability FinOps Red Teaming Compliance

Try it free — no signup

Free plan — 50,000 traces/month. No credit card required.

SOC 2 evidence engine ISO 42001 mapping EU AI Act GDPR HIPAA-aligned

Control-evidence automated. Third-party audits begin Q4 2026.

red-team demo2.57ms p95 · inline

❯

Firewall

Judge

Verdict

inspecting…

Evidence log

simulated console · same rules as productionRun it on your agent

300+

attack plugins

200+

scorers

90+

providers

2.57ms

firewall p95

The inspection spine

One inline control plane for every LLM call.

Drop EvalGuard in front of any provider. Every request is evaluated, firewalled, routed, and traced — with zero changes to your model code.

PROMPT

every request in

FIREWALL

injection · PII · secrets

JUDGE

scored live

VERDICT

pass / blocked

● PASS

TRACE

OTel span out

PROMPT

every request in

FIREWALL

injection · PII · secrets

JUDGE

scored live

VERDICT

pass / blocked

● PASS

TRACE

OTel span out

…and resolves to one console

app.evalguard.ai/dashboard/eval

Last 30 days

Production eval — support-agent · prod

Live

+1.4%

0.0%

Pass rate

+0.03

0.00

Avg score

−128ms

0ms

p95 latency

−$0.18

$0.00

Cost / 1k

94%

Eval pass rate

247 / 263

+1.4%

Pass rate over time

30d90d

Scorer

Threshold

Score

Status

answer-relevance

0.80

0.94

Pass

hallucination

—

no leak

Pass

prompt-injection

0.50

0.04

Pass

context-faithfulness

0.80

0.78

Below

Firewall · live

last 60s

●prompt-injection0.94BLOCKED

●PII · email1×REDACTED

●jailbreak · DANt3RESISTED

●secret · aws-key2.1msBLOCKED

Agent trace

1.24s

agent.run

1.24s

retrieve

142ms

llm.chat

680ms

tool.search

320ms

Firewall · live

BLOCKED prompt-injection

2.41ms · 0.94 conf

Compliance

SOC 2 evidence collected

12 of 12 CC controls

Gateway

91 providers · 0 outages

12.4k req / sec

Works with your entire stack

One platform for every model you ship.

One platform

One provable layer over the controls you already run.

EvalGuard turns your evals, red-team results, runtime guardrails, and monitoring into one audit-ready evidence trail — the visibility and proof layer your risk and compliance teams actually need. Works on top of, or in place of, the controls you already run.

Eval

Score every model change before it ships.

200+ built-in scorers + custom graders
Datasets, versioning, and CI pass/fail gates
LLM-as-judge, embedding, exact-match, RAG

Explore Eval

evalguard · eval live

faithfulness0.94 ✓

context-precision0.91 ✓

regression gatePASS

Eval

Open & verifiable

Don't take our word for it — run the proof yourself.

Every claim below is independently checkable today. No vanity logos.

Open-source SDKs

Apache-2.0 packages on npm & PyPI — TypeScript, Python & Go. Read the published source and pin any version.

Browse on npm

Install in one line

npm i @evalguard/corepip install evalguardai

Quickstart

Reproducible benchmarks

2.57ms firewall p95 and the first independent NeMo-Guardrails head-to-head — methodology you can re-run.

See the numbers

Self-host & BYOK

Run the whole platform in your own VPC and bring your own model keys — no data leaves your boundary.

Self-hosting guide

Want to shape the roadmap and get white-glove onboarding?

Become a design partner

Proof, not promises

A 2.57ms-p95 firewall that stays flat under load.

A naive in-band scanner degrades as RPS climbs. EvalGuard's pre-filter holds its p95 — inline protection without a latency tier.

EvalGuard firewall (~2.57ms p95)Naive in-band scanner

Red-team coverage benchmarks vs Promptfoo, Garak, and PyRIT live on /features. Read the methodology.

5-minute setup

Three lines. Production-grade eval.

No infra. No vendor lock-in. The SDK runs in Node, edge, browser, and CI — same code path, same scorers, same results.

TypeScript-first SDK with full type inference
90+ providers via the unified gateway
Streams results — no batch-waiting
Edge-runtime compatible (Vercel + Cloudflare)

guard.ts

import { evaluate } from "@evalguard/core";

const result = await evaluate({
  input: "What's the capital of France?",
  output: response,
  assertions: [
    { type: "answer-relevance", threshold: 0.8 },
    { type: "hallucination" },
    { type: "pii-leak" },
  ],
});

if (!result.passed) {
  console.log(result.failingCriteria);
}

Run 200+ scorers in 3 lines

Lifecycle

From your first eval to the production firewall.

One workspace covers the full lifecycle — pre-launch evals, red-team scans, runtime guardrails, and audit-ready compliance.

01 — Evaluate

Test every prompt before it reaches production.

Run 200+ built-in scorers across faithfulness, relevance, toxicity, and more. Create custom LLM-as-judge evaluators. Catch regressions before your users do.

02 — Secure

Red team your AI with 300+ attack plugins.

Automated adversarial testing across 100+ strategies — prompt injection, jailbreak, PII extraction, and more. OWASP LLM Top 10 compliance reports instantly.

03 — Debug

Chrome DevTools for your AI agents.

Visualize every step of your agent's reasoning chain. Detect infinite loops, identify tool call failures, and pinpoint where things went wrong.

04 — Monitor

Real-time observability for every LLM call.

Track latency, cost, and quality in real time. Set alerts on drift, spikes, and anomalies. Get notified before your users complain.

Receipts

What auditors, security teams, and ML engineers actually ask for.

Real attack coverage, real eval scoring, real evidence exports — not slides.

Evaluation Engine

200+ built-in scorers, custom LLM-as-judge evaluators, A/B testing, and CI/CD integration.Try interactive demo

200+ scorers

Security Scanner

300+ attack plugins across 100+ strategies with OWASP LLM Top 10 compliance.

OWASP LLM Top 10

Agent Debugging

Full trace visualization with infinite loop detection and root cause analysis.

agent.run() → llm.chat() → tool.search()

LLM Firewall

Real-time content filtering at 2.57ms p95. Block prompt injections, PII leaks, and secrets before they reach your model.

2.57ms p95

Monitoring

Real-time dashboards for latency, cost, quality drift, and anomaly detection.

OTel-native

Compliance Evidence

Hash-verified evidence auto-collected and mapped to SOC 2, ISO 42001, and EU AI Act controls — exportable to your auditor via API.

Evidence engine live

AI Gateway

Route 90+ providers through one proxy — auth, cost tracking, scoring, and audit logging in a single call. Zero lock-in.

gateway · live1 API

requestroute gpt-4o→claudecache servedfirewall clean200 · 41ms

90+ providers · 1 API

ONLY ON EVALGUARD

NL→Eval Pipeline

Describe your AI app in plain English. EvalGuard's proprietary NL pipeline generates a complete evaluation suite — test cases, security scans, compliance checks — in seconds. Zero configuration.

$“a support agent that reads orders”

generated eval suite

faithfulnesspii-leakjailbreak+18 cases

App profile → Full eval suite

PRODUCTIZED

Adaptive Red Teaming

AI vs AI security testing. Multi-turn conversations with UCB1 bandit optimization, real-time resistance profiling, 8 industry-specific packs, and a persona library — productized for enterprise, not a research toolkit.

adaptive · multi-turnUCB1 bandit

Turn 1

resisted

Turn 2

resisted

Turn 3

breached

Turn 1 Resisted → Turn 3 Breached

Quickstart

Three commands: install, eval, gate.

No infrastructure to stand up. The third command fails your CI build when quality drops below your threshold.

01 — Install the CLI

terminal — npm

# Install the CLI globally
npm install -g @evalguard/cli

02 — Run your first evaluation

terminal — evalguard

# Run an evaluation
evalguard eval evalguard.yaml \
  --model gpt-4o

03 — Gate your CI

ci — quality gate

# Add to CI/CD pipeline
evalguard gate --threshold 0.9
> All checks passed. Deploying...

Personas

Built for your role.

Tailored workflows for every stakeholder in the AI pipeline.

For CISOs

Automated OWASP LLM Top 10 compliance
Real-time vulnerability dashboard
SOC 2 readiness & GDPR audit documentation
Policy enforcement across all AI endpoints

security posture · live

OWASP LLM Top 1010 / 10

Risk posturelow

Open findings0

Security overview

For Engineering Leads

CI/CD quality gates for LLM outputs
Cost optimization with caching & routing
Team-wide evaluation dashboards
Incident root cause analysis

ci · PR #482merge ok

eval gatepassed

firewallclean

scorers4 / 4

Engineering workflow

For ML Engineers

200+ pre-built + custom evaluation metrics
A/B model comparison with confidence intervals
Trace-level debugging for agent chains
Dataset versioning with golden test sets

eval · gpt-4o-minipass

faithfulness0.94

answer-relevance0.91

toxicity0.02

ML workflow

Enterprise

Built for Enterprise.

Enterprise-grade security, compliance, and deployment options from day one.

SOC 2 control-evidence engine — live

Every SOC 2 control mapped to the feature that satisfies it, backed by hash-verified evidence you can export to your auditor today. Third-party audit process begins Q4 2026 — see /trust.

evidence · live12 / 12 CC controls

CC6.1Logical accessevidence

CC7.2Threat detectionevidence

CC8.1Change mgmtevidence

Controls + evidence live

GDPR Evidence Engine

Data-subject-request intake, atomic right-to-erasure, and consent gates wired — backed by exportable control evidence, with DPA templates and EU data-residency options.

EU data residency

SSO / SAML

SAML 2.0 + OIDC + SCIM with auto-provisioning, JIT role mapping, and break-glass owner override.

Okta · Azure · Google

Self-Hosted

Docker Compose for air-gapped deploys today. VPC deployment on AWS/GCP/Azure on the 2026 roadmap.

Docker · K8s

RBAC + Audit Log

Granular role-based access control with append-only audit logging — every action attributed and retained 7 years.

4 roles · WORM logs

Enterprise SLA

Enterprise tier with dedicated support, named SRE, and escalation paths to engineering on-call. See trust/sla for the full commitment.

99.97%

measured · 90d

41ms

gateway p50

P1 response

open incidents

99.95% target · 1h response

FAQ

Eight questions, answered straight.

How is EvalGuard different from other LLM evaluation tools?

Most tools cover one layer — eval, security, or observability. EvalGuard unifies eval, firewall, gateway, observability, red-team, and compliance into one provable, audit-ready evidence trail so signals compose end-to-end — whether it replaces those point tools or sits on top of controls you already trust. The win isn't fewer logos; it's one record your risk and compliance teams can actually verify. We also ship roughly 2× the red-team coverage (300+ attack plugins vs 125 for the nearest open-source tool, Promptfoo) and the only adaptive multi-turn red team that productionizes UCB1 bandit attack-strategy selection.

Can I use EvalGuard with any LLM provider?

Yes — 90+ typed providers in the gateway today (OpenAI, Anthropic, Gemini, Bedrock, Azure, Mistral, Cohere, DeepSeek, xAI, Together, Replicate, OpenRouter, Groq, Perplexity, and more). Same SDK call shape across all of them. BYOK key vault, automatic failover, semantic caching, and cost tracking are uniform — you don't write provider-specific code.

How long does it take to get started?

Minutes. Run `pip install evalguardai` (or `npm install -g @evalguard/cli`), drop an `evalguard.yaml` in your repo, and run `evalguard eval --model gpt-4o`. The free tier has no credit card, no time limit. Add `evalguard gate --threshold 0.9` to your CI/CD pipeline and you have eval gates on every PR.

Is my data secure? Do you store prompts and completions?

Yes and by default no. All data is encrypted at rest (AES-256-GCM) and in transit (TLS 1.3). Prompts and completions are NOT stored unless you explicitly enable trace logging per project. When enabled, retention is configurable per org and append-only audit logs track every access. Enterprise plans support BYOK envelope encryption, VPC deployment, and full air-gapped mode.

What does the free tier include?

The most generous free tier in AI evals: 50,000 traces/month (5× Helicone/Portkey, matches Langfuse), 100 red-team scans, all 300+ attack plugins, all 200+ scorers + custom, AI Gateway (route/cache/failover) — yes, free — pairwise eval, ELO leaderboard, annotation queues (50/mo), full 5-layer firewall, OTel observability, 30-day retention, unlimited projects. No credit card. Upgrade to Pro ($49/mo) for 500K traces, team collaboration, FinOps dashboard, and the Prompt IDE.

How does the security scanner work?

Point it at your model endpoint (OpenAI URL, Anthropic, your hosted model, anything). It runs 300+ attack plugins across 100+ strategies — prompt injection, jailbreak, PII extraction, data exfiltration, multi-turn crescendo, role-play escalation. The adaptive red-team uses UCB1 to focus on attacks that are actually working. Output is an OWASP LLM Top 10 + OWASP Agentic Top 10 compliance report plus per-finding evidence (request, response, attack vector, suggested fix).

Can I create custom evaluation metrics?

Yes. Two paths: (1) Write a custom LLM-as-judge with any grading rubric — pass a prompt template, EvalGuard handles the eval-loop + scoring. (2) Drop in a TypeScript / Python function that takes (input, output, expected) and returns a score 0–1. Both run through the same eval engine, same reports, same CI gates as the 200+ built-in scorers.

Do you offer enterprise features?

Yes — a live SOC 2 control-evidence engine (exportable to your auditor today; third-party audit process begins Q4 2026), GDPR with 2 EU data-residency regions (Ireland + Frankfurt), SSO/SAML/SCIM with Okta + Azure + Google, RBAC with 4 built-in roles + custom roles + WORM audit logs (7-year retention), VPC + on-prem deployment, named SRE + 1h SLA response, 99.95% uptime target. See /enterprise for the full feature list.

Compliance

Audit reports your CISO will sign.

EU AI Act risk classification, ISO 42001 statement-of-applicability, SOC 2 evidence collector, OWASP LLM/Agentic Top 10 — 50 frameworks mapped out of the box.

SOC 2

Evidence engine

ISO 42001

AI Mgmt System

EU AI Act

Annex IV

GDPR

EU residency

HIPAA

Aligned

OWASP LLM

Top 10

Early-access feedback

What testers noticed first.

Real feedback from our early-access circle.

“We swapped new OpenAI() for the wrapper and nothing else in the pipeline had to change. Traces and the cost ledger populated on their own, and the eval-on-response flag gave us a per-call quality score with no extra wiring. It slotted into our pipeline in an afternoon — most observability add-ons take us a full sprint.”

MLOps engineerEarly-access tester · quote on file

“We deployed the firewall inline in front of an agent stack we were piloting, and it caught the model echoing back a customer email it shouldn’t have — before our own review did. p95 stayed under 3ms; it’s the first inline guardrail I’ve benchmarked that didn’t add a latency tier.”

Security infrastructure leadEarly-access tester · quote on file

Ready to ship better AI?

Free forever — 50,000 traces/month, AI Gateway included.Pro plans start at $49/mo.

Start evaluating, securing, and monitoring your AI in production today.

Create your free workspace

View Pricing

No credit card requiredFree forever tierEnterprise-grade security