Paste a candidate response below. We'll grade it against the same deep LLM-judge the production safe-regenerate endpoint uses — three pillars, per-criterion verdicts, structured reasoning. No signup, no API key, no rate-limit headers to debug.
Live evaluator · no signup
Same deep grader the production safe-regenerate endpoint uses. Paste your own content in the box below — or click a sample chip to pre-fill.
EvalGuard Score
0.0
Critical
FAIL @ 0.8
Safety
0.0
3845ms
Fairness
0.0
3558ms
Faithfulness
0.0
5782ms
Effort
medium
gpt-4o-mini
Latency
5791ms
Tokens
4517
Cost
<$0.001
Summary
Below threshold on: safety (0.00), fairness (0.00), faithfulness (0.00).
Per-criterion breakdown
Behind the curtain
The widget is the same code path our production customers run. Here's what shipping this in your stack looks like.
The grader
Each dimension is a structured prompt that asks the judge to grade against 3–7 explicit criteria, returning a per-criterion verdict + a 0–1 score + a sentence of reasoning. Imported as biasDeepConfig, toxicDeepConfig, faithfulnessDeepConfig from @evalguard/core.
Concept: evaluation modesThe endpoint
Real production endpoint. Adds: BYOK provider keys (Anthropic / Gemini / 89 others), cost-budget gating (HTTP 402 if over budget), regen loop, audit row in safe_regenerate_runs, ledger entry, policy engine hooks.
API referenceWhat's different in production
This demo gates on safety / fairness / accuracy. Production adds reliability, transparency, privacy, accountability, user-impact, plus an inline 2.57ms-p95 firewall that pre-filters keyword-shaped attacks before the LLM judge ever fires. Total guardrail overhead: ~5ms.
Concept: firewall vs scorerCalibration
The demo's 0.8 default is a general-purpose chat threshold. Healthcare tightens safety/accuracy to 0.9. Internal dev tooling loosens to 0.7. The eval call returns raw scores; the policy engine maps them to actions.
Concept: scoring thresholdsBeyond inline scoring
Inline eval is one of six products. Most enterprise customers start with the firewall + compliance evidence, then layer in red-team + gateway as their AI surface grows.
249 attack plugins × 42 strategies. CLI, CI/CD, or API.
Learn more2.57ms p95 pre-LLM gate. PII, injection, secrets, DLP.
Learn moreWrap 91 providers with auth, cost, scoring, audit in one call.
Learn moreOTLP-native, ClickHouse rollups, anomaly alerts, span search.
Learn more33 frameworks: SOC 2, ISO 42001, EU AI Act, DPDP, HIPAA, GDPR.
Learn moreFree tier · BYOK from day 1 · self-hostable on Docker / K8s / Helm