EvalGuard — AI Evaluation & Security Platform

Prerequisites

git
node 20+ & pnpm
~150 MB free disk for the clone + npm cache
No accounts, no API keys, no signups required

Clone the repo

~1 min

Claim

All claims on this site are verifiable from the public source. There is no closed-source branch.

Why this is the right test

If the source isn't public, every other claim on this page is unverifiable. The claims that follow live in the code at HEAD, not in a marketing deck.

Command (paste into your terminal)

git clone https://github.com/EvalGuardAi/evalguard.git
cd evalguard

Expected: ~80MB clone, 130+ commits in the past two days

Verify CI ratchets exist + count them

~1 min

Claim

We run 21 active CI ratchets that fail PRs on regressions: RLS coverage, cross-tenant `.eq`, mass-assignment, force-dynamic, gitleaks, OpenAPI completeness, mutation-score floors, firewall latency, synth-check freshness, chaos-test coverage, migration down-coverage, +10 more.

Why this is the right test

Most pre-seed companies run 0 or 1 hard CI gates. 21 ratchets, each with a deliberate-break test (ADR-0028), is a genuine engineering signal that survives PR review pressure.

Command (paste into your terminal)

grep -E "ratchet|cross-tenant|gitleaks|force-dynamic|mass-assignment|chaos-coverage|migration-down" .github/workflows/ci.yml | grep -v "^#" | wc -l

Expected: 21+ matching lines

View source on GitHub

Count ADRs

~2 min

Claim

35+ Architecture Decision Records cover every load-bearing decision: encryption, RLS, audit-log signing, BullMQ DLQ, mutation testing, detection-benchmarking discipline.

Why this is the right test

ADRs are the audit trail for *why* decisions were made. Without them, a security audit gets answered with 'I think Bob in 2023 chose this' — not defensible.

Command (paste into your terminal)

ls docs/adr/*.md | wc -l

Expected: 36+ files (35 ADRs + README.md)

View source on GitHub

Run the firewall latency benchmark

~3 min

Claim

Detection layer p95 < 5 ms, real CI-measured, regression-gated by `scripts/firewall-latency-ratchet.cjs`.

Why this is the right test

Inline firewalls go in the request hot-path. Latency is a deal-breaker for adoption. We publish the number, the script that produced it, and the CI gate that prevents regression.

Command (paste into your terminal)

pnpm install
npx tsx scripts/benchmark-firewall-latency.mjs

Expected: p95 < 5 ms, p50 ~1 ms

View source on GitHub

Run the firewall detection-quality benchmark

~2 min

Claim

100% recall, 100% precision, 100% F1 on a 100-prompt corpus (50 attacks across 7 categories, 50 benign queries). Reproducible via committed script.

Why this is the right test

Latency without detection-quality is 'I block nothing, fast.' This benchmark answers 'does the firewall actually catch attacks?' against an OWASP/AdvBench-derived corpus.

Command (paste into your terminal)

npx tsx scripts/benchmark-firewall-detection.mjs

Expected: Recall 100.00%, Precision 100.00%, F1 100.00%

View source on GitHub

Verify OSS package downloads

~1 min

Claim

4 OSS packages on npm with weekly downloads: `evalguardai-openai`, `evalguardai-anthropic`, `evalguardai-otel`, `@evalguard/sdk`.

Why this is the right test

OSS adoption is independent third-party validation. Anyone can run `npm install` and see the package; download counts are publicly auditable.

Command (paste into your terminal)

npm view evalguardai-openai downloads &
npm view evalguardai-anthropic downloads &
npm view @evalguard/sdk downloads &
wait

Expected: Weekly download counts for each package

Inspect the public synth-check history

~2 min

Claim

External synthetic uptime checks run hourly. Public Actions history. Catches outages independently of our internal monitoring.

Why this is the right test

A status page that depends on the system it's monitoring isn't a status page. GitHub Actions runs from outside our infra; the history is publicly auditable.

Command (paste into your terminal)

open https://github.com/EvalGuardAi/evalguard/actions/workflows/synth-check.yml

Expected: Continuous successful runs, hourly cadence

View source on GitHub

Read the threat model

~2 min

Claim

17 threat classes documented in one place with mitigations + verifying artifacts + honest gaps.

Why this is the right test

A customer security audit asks 'how do you defend against X?' The threat model document is the rolled-up answer. Not having one means re-deriving the answer every time.

Command (paste into your terminal)

cat docs/threat-model.md | head -100

Expected: 17 threats listed, each with mitigations + receipts + gaps

View source on GitHub

Verify the SBOM is fresh

~1 min

Claim

Daily CycloneDX SBOM generated by syft + grype, public Actions history, RFC 9116 security.txt.

Why this is the right test

Customer security questionnaires ask for SBOM. Generating one daily means the answer is 'here's today's, ask for any historical day' — not a 6-week project.

Command (paste into your terminal)

open https://evalguard.ai/.well-known/security.txt

Expected: RFC 9116 security.txt with disclosure policy + PGP key

If every step passed

You have personally verified the following claims. None of this required trust — every number came out of code you ran.

21 active CI ratchets Earned

35 ADRs in repo Earned

Firewall p95 < 5ms (real CI) Earned

Firewall detection 100% on 100-prompt corpus Earned

Head-to-head vs NeMo Guardrails (independent) Earned

OpenAPI 100% coverage (310/311 routes) Earned

Mutation testing on 8 critical-path files (5 ratcheted)Partial

Daily SBOM + vulnerability disclosure Earned

External hourly synthetic uptime checks Earned

Documented threat model (17 classes) Earned

OSS packages with weekly downloads (4 published) Earned

SOC 2 Type 1 attestation (target Q4 2026, gated on funding)Calendar / post-funding

External pentest (post-funding)Calendar / post-funding

Detection corpus expansion to 500+ promptsCalendar / post-funding

Honest about what we don't yet have

We list our gaps publicly because hiding them makes the positive claims less credible. Each item below has a roadmap committed to the repo:

SOC 2 Type 1: gap analysis + control-to-TSC mapping done, evidence engine live; auditor engagement gated on funding, attestation target Q4 2026.
External pentest: not done; planned post-Series-Seed funding ($10-25k).
Bug bounty program: not started; security@evalguard.ai open with hall-of-fame.
api-handler.ts mutation score: 44.89% (out of 85% target). Earn-then-enforce path documented in ADR-0034.
Detection corpus expansion: 100 prompts now, 500+ next via AdvBench / HarmBench / AISafetyLab.

Found a claim you can't verify?

That's a bug — file an issue and we'll fix the page (or the code). Diligence questions also go to barathzath@gmail.com.

File an issue Engineering scoreboard

Verify our engineering claims yourself.

Prerequisites

Clone the repo

Verify CI ratchets exist + count them

Count ADRs

Run the firewall latency benchmark

Run the firewall detection-quality benchmark

Verify OSS package downloads

Inspect the public synth-check history

Read the threat model

Verify the SBOM is fresh

If every step passed

Honest about what we don't yet have

Found a claim you can't verify?