Engineering claims with receipts. Every number on this page is verifiable from outside the company — link to the file, the commit, or the workflow that produced it.
Last updated: 2026-05-05. Source of truth:docs/defensibility-roadmap.md.
Doing diligence? The 15-minute walkthrough at /verify walks you through reproducing every claim below — clone, run, see the same numbers. No accounts needed.
Defensibility scoreboard
We track 14 binary success criteria for the “world's best engineering” claim. Each is verifiable from outside the company. Status:
- #1. 7+ day green-main streakCalendarCalendar — passive accrual.
- #2. 0 lint warningsPartial825 → 288 (-65% session). Multi-day finish.
- #3. < 100 silent skipsEarned13 silent (was 154). Commit `5ca0cb5c`.
- #4. Critical-path --strict (95% lines / 90% branches)Earnedapi-handler 91.7%, crypto 100%, audit 94.1%. Commit `46994c35`.
- #5. Mutation score > 85% on 3 critical filesPartialScope expanded 3 → 8 files in ratchet 19. crypto 96.55%, audit 89.66%, api-handler 44.89%, detection-engine 20.60%, rule-builder 58.17%, statistics 50.92% (was 6.89%), ml-classifier 45.69% (was 0%), guardrail-dsl 10.20% (was 0%). 2 above the 85% bar; rest are earned-then-enforce. Phase 2 (2026-05-06) added 124 direct unit tests across 3 files; lifts: statistics +44pp / ml-classifier +46pp / guardrail-dsl +10pp. Doubles the scope of mutation testing.
- #6. SOC 2 Type 1 attestationBlockedEvidence engine live + gap analysis done; auditor engagement gated on funding, attestation target Q4 2026. See`docs/soc2-starter-pack.md`.
- #7. External synthetic uptime checksPartialWorkflow live + first green run. Earnable after 24h of probes. Commit `4dc1d1eb`.
- #8. Sustained weekly blog cadence (12 posts)CalendarVolume bar earned (12/12). Sustained-cadence test starts 2026-05-12 with post 13.
- #9. 3+ OSS packages with downloadsEarned4 packages live: 444 downloads/week. evalguardai-{openai, anthropic, otel}, @evalguard/sdk.
- #10. Public head-to-head benchmarksEarned3 reproducible benchmarks: firewall-latency (p95=1.11ms), firewall-detection-quality (100%/100%/100% on 200-prompt corpus, doubled + sourdough FP closed 2026-05-06), NeMo Guardrails head-to-head (1st independent). Commit `8637e975`.
- #11. OpenAPI spec — 100% public-route coverageEarned310/311 routes documented. 1 allowlisted (404 catch-all). Commit `878476e8`.
- #12. 12+ engineering blog postsEarned12 published. /blog index.
- #13. 30+ ADRs in repoEarned37 ADRs. `docs/adr/`. ADR-0001 through ADR-0037.
- #14. SBOM + security.txt + vulnerability disclosureEarnedDaily syft + grype CycloneDX SBOM. RFC 9116 security.txt. Public disclosure policy.
The numbers
.github/workflows/ci.ymldocs/adr/registry.npmjs.orgapi.npmjs.orgapps/web/public/openapi.jsonscripts/.mutation-score-baseline.jsonscripts/.firewall-latency-baseline.jsonsession deltas — chip pass disciplinescripts/.skip-baseline.json/blogClaims with receipts
Customer audits ask specific questions. Here are 8 of them with our specific answers and the file you can read to verify.
What we don't yet have
Honest gaps. Each is being addressed; none are being hidden.
- api-handler.ts mutation score: 44.29% — the load-bearing API middleware has high line + branch coverage (97.1% / 91.7%) but mutation testing reveals 278 surviving mutants, mostly StringLiteral mutations on log messages and switch-case branch labels. Multi-day work to close. Tracked in scripts/.mutation-score-baseline.json.
- Firewall detection-QUALITY benchmark — we publish latency (1.13 ms p95) but not yet recall/precision against a public attack corpus (HarmBench / GarakAI). A fast firewall that misses attacks is worse than a slow one that catches them. Detection-quality benchmark is the next P4.3 work item.
- SOC 2 Type 1 attestation — gap analysis + control-to-TSC mapping done (leaning Drata), in docs/soc2-starter-pack.md, and the evidence engine is live; the auditor engagement is gated on funding. Attestation target Q4 2026. The /security page does not claim any SOC 2 status until the auditor letter is signed.
- External pentest — none commissioned yet. Planned post-Type-1 attestation using a HackerOne or Big-4 firm.
- Lint warnings: 288 remaining — down from 825 (-65%) across 3 chip passes this session. Multi-session per-file work to reach zero. Tracked in P1.2 task.
Sources of truth
docs/defensibility-roadmap.md — the 14-criterion scoreboard, updated each round.
docs/adr/ — 34 ADRs covering BYOK encryption, cross-tenant defense, audit signing, ratchet discipline, etc.
.github/workflows/ci.yml — 17 active CI ratchets, each with deliberate-break verification (per ADR-0028).
benchmarks/ — public benchmarks with reproducible measurement scripts.
docs/soc2-starter-pack.md — SOC 2 Type 1 calendar, vendor comparison, control map.
Synthetic uptime probe history — public Actions runs, hourly probe of 3 production endpoints.
/blog — 12 engineering blog posts covering audit + ratchets + security postmortems.
Found a discrepancy between this page and the underlying receipt? security@evalguard.ai — we'd rather correct it than leave it.
On “world's best engineering”
We use a 14-criterion scoreboard rather than a marketing superlative because superlatives can't be verified. A “world's best” claim is worth what its receipts are worth. The number above (8 of 14 earned) is honest — externally checkable from this repo. We'd rather earn the claim line by line than assert it.