Concept · Evaluation modes

Basic vs deep evaluation

EvalGuard ships two evaluator depths. They share the same dimensions (safety, fairness, accuracy, reliability, transparency, privacy, accountability, user-impact) but differ in cost, latency, and what they can detect. Picking the right one for the right traffic class is the single biggest knob on your eval bill.

Basic scorers — keyword + regex + ML classifier

Run locally, no LLM call. Sub-millisecond per dimension. Bias check is a learned classifier; toxicity uses Perspective API patterns; PII leakage is regex + Aadhaar/PAN/UPI validators; reliability checks are structural (JSON-valid, regex-match, length-check). Cost in tokens: zero. Cost in dollars: zero beyond your own compute.

Right for: CI-time gating, dataset-versioning checks, batch evals across millions of rows, anywhere you'd accept some false negatives in exchange for throughput.

Deep scorers — LLM-as-judge with per-criterion rubric

Each dimension's deep config (biasDeepConfig, toxicDeepConfig, faithfulnessDeepConfig, etc.) is a structured prompt that asks an LLM to grade against 3–7 explicit criteria. Returns a per-criterion pass/fail/partial + an overall 0–1 score + a 1-sentence reasoning string. Catches nuanced bias the basic classifier misses ("female candidates often need extra support" — a textbook gender claim).

Cost: 1 LLM call per dimension per evaluation. With gpt-4o-mini at the default 600-token cap, that's ~$0.0003 per dim. Three dims at basic-set parity = ~$0.001 per evaluation. Latency: 2–5 seconds per dim (parallelisable).

Right for: production response gating, regulator-facing audits, any eval where a single false negative is more expensive than three LLM calls.

Picking a default

The POST /api/v1/evals/safe-regenerate endpoint takes a scorerSet field with values basic or deep. basic gates on safety + fairness + accuracy; deep adds reliability, transparency, privacy, accountability, user-impact (all 8 pillars).

For new integrations, start on deep and tune down only if you measure the cost mattering — the gateway's basic firewall already does the cheap pre-filter inline, so by the time a response reaches safe-regenerate you've already paid the rendering cost. The extra $0.001 per call buys real bias detection.

What this maps to under the hood

  • Basic scorers live in packages/core/src/scorers/. 188 slugs, deep-linked from the scorer catalog.
  • Deep scorers live in packages/core/src/security/graders/deep/ and packages/core/src/scorers/deep/ — 10 graders + 30 deep configs. Imported from @evalguard/core at the package root (re-export added in commit 105b91c6; live-fire surfaced the gap when every call was falsely returning score=1).

Related concepts