Find every way your LLM agent can be jailbroken or leak data — before your customers do — then ship the compliance evidence to prove it. BYOK, self-hostable, no vendor lock-in.
Works with your entire stack
By the numbers
Built for high-stakes industries
Proof, not promises
More red-team coverage than any open-source tool — running behind a firewall whose p95 stays flat as traffic scales.
Built-in attack plugins. More vectors → more real findings, before your customers hit them.
built-in attack plugins
A naive in-band scanner degrades as RPS climbs. EvalGuard's pre-filter holds ~2.57ms.
No infra. No vendor lock-in. The SDK runs in Node, edge, browser, and CI — same code path, same scorers, same results.
import { evaluate } from "@evalguard/core";
const result = await evaluate({
input: "What's the capital of France?",
output: response,
assertions: [
{ type: "answer-relevance", threshold: 0.8 },
{ type: "hallucination" },
{ type: "pii-leak" },
],
});
if (!result.passed) {
console.log(result.failingCriteria);
}Lifecycle
One workspace covers the full lifecycle — pre-launch evals, red-team scans, runtime guardrails, and audit-ready compliance.
01 — Evaluate
Run 188 built-in scorers across faithfulness, relevance, toxicity, and more. Create custom LLM-as-judge evaluators. Catch regressions before your users do.
02 — Secure
Automated adversarial testing across 42 strategies — prompt injection, jailbreak, PII extraction, and more. OWASP LLM Top 10 compliance reports instantly.
03 — Debug
Visualize every step of your agent's reasoning chain. Detect infinite loops, identify tool call failures, and pinpoint where things went wrong.
04 — Monitor
Track latency, cost, and quality in real time. Set alerts on drift, spikes, and anomalies. Get notified before your users complain.
Receipts
Real attack coverage, real eval scoring, real evidence exports — not slides.
Quickstart
Three steps. No infrastructure to manage.
# Install the CLI globally
npm install -g @evalguard/cli# Run an evaluation
evalguard eval evalguard.yaml \
--model gpt-4o# Add to CI/CD pipeline
evalguard gate --min-score 0.9
> All checks passed. Deploying...Personas
Tailored workflows for every stakeholder in the AI pipeline.
Enterprise
Enterprise-grade security, compliance, and deployment options from day one.
FAQ
Most tools cover one layer — eval, security, or observability. EvalGuard unifies all six (eval, firewall, gateway, observability, red-team, compliance) on one platform so signals compose end-to-end. You don't stitch evals + Helicone + Promptfoo + a homegrown firewall; you run one workspace with one auth, one bill, one SLA. We also ship 5× the red-team coverage (249 attack plugins vs 50–60 for the nearest open-source tool) and the only adaptive multi-turn red team that productionizes UCB1 bandit attack-strategy selection.
Yes — 91 typed providers in the gateway today (OpenAI, Anthropic, Gemini, Bedrock, Azure, Mistral, Cohere, DeepSeek, xAI, Together, Replicate, OpenRouter, Groq, Perplexity, and more). Same SDK call shape across all of them. BYOK key vault, automatic failover, semantic caching, and cost tracking are uniform — you don't write provider-specific code.
Minutes. Run `pip install evalguardai` (or `npm install -g @evalguard/cli`), drop an `evalguard.yaml` in your repo, and run `evalguard eval --model gpt-4o`. The free tier has no credit card, no time limit. Add `evalguard gate --min-score 0.9` to your CI/CD pipeline and you have eval gates on every PR.
Yes and by default no. All data is encrypted at rest (AES-256-GCM) and in transit (TLS 1.3). Prompts and completions are NOT stored unless you explicitly enable trace logging per project. When enabled, retention is configurable per org and append-only audit logs track every access. Enterprise plans support BYOK envelope encryption, VPC deployment, and full air-gapped mode.
The most generous free tier in AI evals: 50,000 traces/month (5× Helicone/Portkey, matches Langfuse), 100 red-team scans, all 249 attack plugins, all 188 scorers + custom, AI Gateway (route/cache/failover) — yes, free — pairwise eval, ELO leaderboard, annotation queues (50/mo), full 5-layer firewall, OTel observability, 30-day retention, unlimited projects, 3 team members. No credit card. Upgrade to Pro ($49/mo) for 500K traces, FinOps dashboard, and the Prompt IDE.
Point it at your model endpoint (OpenAI URL, Anthropic, your hosted model, anything). It runs 249 attack plugins across 42 strategies — prompt injection, jailbreak, PII extraction, data exfiltration, multi-turn crescendo, role-play escalation. The adaptive red-team uses UCB1 to focus on attacks that are actually working. Output is an OWASP LLM Top 10 + OWASP Agentic Top 10 compliance report plus per-finding evidence (request, response, attack vector, suggested fix).
Yes. Two paths: (1) Write a custom LLM-as-judge with any grading rubric — pass a prompt template, EvalGuard handles the eval-loop + scoring. (2) Drop in a TypeScript / Python function that takes (input, output, expected) and returns a score 0–1. Both run through the same eval engine, same reports, same CI gates as the 188 built-in scorers.
Yes — SOC 2 Type II architecture (audit in progress), GDPR with 7 EU residency regions, SSO/SAML/SCIM with Okta + Azure + Google, RBAC with 5 roles + WORM audit logs (7-year retention), VPC + on-prem deployment, named SRE + 1h SLA response, 99.95% uptime target. See /enterprise for the full feature list.
Receipts
Every feature is backed by comprehensive testing.
Compliance
EU AI Act risk classification, ISO 42001 statement-of-applicability, SOC 2 evidence collector, OWASP LLM/Agentic Top 10 — 33 frameworks mapped out of the box.
Early-access feedback
Anonymized real feedback from our beta circle.
“Swapped new OpenAI() for the wrapper and the rest of the pipeline didn’t change. Traces and cost-ledger started populating on their own, and the eval-on-response flag gave me a per-call quality score without me wiring anything up. Slotted into our pipeline in an afternoon — most observability bolt-ons eat a sprint.”“Dropped the firewall inline in front of an agent stack we were piloting. It caught the model echoing back a customer email it shouldn’t have, before our review caught it. p95 stayed under 3ms— first inline guardrail I’ve benched that didn’t add a latency tier.”
Free forever — 50,000 traces/month, AI Gateway included.Pro plans start at $49/mo.
Start evaluating, securing, and monitoring your AI in production today.