Claims with receipts · 2026-05-05

Engineeringdefensibility.

Every number on this page is verifiable from outside the company — link to the file, the commit, or the workflow that produced it.

Engineering claims with receipts. Every number on this page is verifiable from outside the company — link to the file, the commit, or the workflow that produced it.

Last updated: 2026-05-05. Source of truth:docs/defensibility-roadmap.md.

Doing diligence? The 15-minute walkthrough at /verify walks you through reproducing every claim below — clone, run, see the same numbers. No accounts needed.

Defensibility scoreboard

8
Earned
3
Partial
2
Calendar-bound
1
Blocked

We track 14 binary success criteria for the “world's best engineering” claim. Each is verifiable from outside the company. Status:

  • #1. 7+ day green-main streakCalendar
    Calendar — passive accrual.
  • #2. 0 lint warningsPartial
    825 → 288 (-65% session). Multi-day finish.
  • #3. < 100 silent skipsEarned
    13 silent (was 154). Commit `5ca0cb5c`.
  • #4. Critical-path --strict (95% lines / 90% branches)Earned
    api-handler 91.7%, crypto 100%, audit 94.1%. Commit `46994c35`.
  • #5. Mutation score > 85% on 3 critical filesPartial
    Scope expanded 3 → 8 files in ratchet 19. crypto 96.55%, audit 89.66%, api-handler 44.89%, detection-engine 20.60%, rule-builder 58.17%, statistics 50.92% (was 6.89%), ml-classifier 45.69% (was 0%), guardrail-dsl 10.20% (was 0%). 2 above the 85% bar; rest are earned-then-enforce. Phase 2 (2026-05-06) added 124 direct unit tests across 3 files; lifts: statistics +44pp / ml-classifier +46pp / guardrail-dsl +10pp. Doubles the scope of mutation testing.
  • #6. SOC 2 Type 1 attestationBlocked
    Evidence engine live + gap analysis done; auditor engagement gated on funding, attestation target Q4 2026. See`docs/soc2-starter-pack.md`.
  • #7. External synthetic uptime checksPartial
    Workflow live + first green run. Earnable after 24h of probes. Commit `4dc1d1eb`.
  • #8. Sustained weekly blog cadence (12 posts)Calendar
    Volume bar earned (12/12). Sustained-cadence test starts 2026-05-12 with post 13.
  • #9. 3+ OSS packages with downloadsEarned
    4 packages live: 444 downloads/week. evalguardai-{openai, anthropic, otel}, @evalguard/sdk.
  • #10. Public head-to-head benchmarksEarned
    3 reproducible benchmarks: firewall-latency (p95=1.11ms), firewall-detection-quality (100%/100%/100% on 200-prompt corpus, doubled + sourdough FP closed 2026-05-06), NeMo Guardrails head-to-head (1st independent). Commit `8637e975`.
  • #11. OpenAPI spec — 100% public-route coverageEarned
    310/311 routes documented. 1 allowlisted (404 catch-all). Commit `878476e8`.
  • #12. 12+ engineering blog postsEarned
    12 published. /blog index.
  • #13. 30+ ADRs in repoEarned
    37 ADRs. `docs/adr/`. ADR-0001 through ADR-0037.
  • #14. SBOM + security.txt + vulnerability disclosureEarned
    Daily syft + grype CycloneDX SBOM. RFC 9116 security.txt. Public disclosure policy.

The numbers

17
Active CI ratchets
Receipt: .github/workflows/ci.yml
34
Architecture Decision Records
Receipt: docs/adr/
4
OSS packages on npm
Receipt: registry.npmjs.org
444
Weekly OSS downloads
Receipt: api.npmjs.org
310 / 311 (100%)
Public API routes documented
Receipt: apps/web/public/openapi.json
76.83%
Critical-path mutation score (avg)
Receipt: scripts/.mutation-score-baseline.json
1.13 ms
Firewall p95 latency (real CI)
Receipt: scripts/.firewall-latency-baseline.json
288
Lint warnings (down from 825)
Receipt: session deltas — chip pass discipline
13
Silent (undocumented) test skips
Receipt: scripts/.skip-baseline.json
12
Engineering blog posts
Receipt: /blog

Claims with receipts

Customer audits ask specific questions. Here are 8 of them with our specific answers and the file you can read to verify.

Q1. How is multi-tenant data isolated?
Three layers: Postgres RLS on every owned-resource table, every Supabase query chains `.eq("org_id", orgId)` (belt + suspenders), and a static-analysis CI ratchet asserts every owned-table query carries the predicate.
Q2. What's your encryption-at-rest story?
AES-256-GCM with a 12-byte random IV per encryption, GCM auth tag stored alongside ciphertext. Implementation 100% line + branch coverage. Key sourced from secrets manager, wiped from process memory after first load.
Q3. How fast is the firewall?
p95 = 1.13 ms in real CI, p50 = 2.35 ms, against 4-layer scan (pattern + token + semantic + output). Public SLA p95 ≤ 50 ms. Regression-gate ratchet (Ratchet 17) fails CI on degradation > 1.5× baseline.
Q4. What does your CI catch?
17 active ratchets. Hard zero: RLS coverage, force-dynamic, TODO/FIXME, `as any`, dynamic eval, mass-assignment. Hard validation: lint, security.txt freshness, counts-canonical match, --strict critical-path coverage, firewall latency. Lower-only: skip-count (silent: 13), OpenAPI coverage (0 missing), cross-tenant-eq (137 floor), coverage baseline.
Q5. How do you prevent BOLA / IDOR?
Auth gate in createApiHandler middleware (97.1% line / 91.7% branch coverage). Cross-tenant `.eq("org_id")` on every owned-resource query (Layer 2 of ADR-0014). Static-analysis ratchet enforces zero regressions.
Q6. Audit log integrity?
Every audit row carries an entry_signature column — HMAC-SHA256 of the row's content fields signed with AUDIT_SIGNING_KEY. Verifier (cron + on-demand /api/v1/audit-logs/verify) recomputes and compares constant-time. Versioned column list. Detects insider tampering even from service-role-key compromise.
Q7. How do you prevent SSRF in webhooks?
assertPublicUrl(url) helper. Blocklist: localhost, 127.0.0.1, 0.0.0.0, 169.254.169.254 (AWS IMDS), metadata.google.internal, metadata.azure.com. DNS resolution check for private IPv4 / IPv6 link-local + ULA. Protocol whitelist: http/https only. DNS-rebinding defense via re-resolution at fetch time.
Q8. Where do design decisions live?
/docs/adr/ — 34 numbered architecture decision records. Each captures status, date, tags, context (forcing function), decision with alternatives considered, consequences (what it makes easy/hard, review triggers, references). PR-reviewed.

What we don't yet have

Honest gaps. Each is being addressed; none are being hidden.

  • api-handler.ts mutation score: 44.29% — the load-bearing API middleware has high line + branch coverage (97.1% / 91.7%) but mutation testing reveals 278 surviving mutants, mostly StringLiteral mutations on log messages and switch-case branch labels. Multi-day work to close. Tracked in scripts/.mutation-score-baseline.json.
  • Firewall detection-QUALITY benchmark — we publish latency (1.13 ms p95) but not yet recall/precision against a public attack corpus (HarmBench / GarakAI). A fast firewall that misses attacks is worse than a slow one that catches them. Detection-quality benchmark is the next P4.3 work item.
  • SOC 2 Type 1 attestation — gap analysis + control-to-TSC mapping done (leaning Drata), in docs/soc2-starter-pack.md, and the evidence engine is live; the auditor engagement is gated on funding. Attestation target Q4 2026. The /security page does not claim any SOC 2 status until the auditor letter is signed.
  • External pentest — none commissioned yet. Planned post-Type-1 attestation using a HackerOne or Big-4 firm.
  • Lint warnings: 288 remaining — down from 825 (-65%) across 3 chip passes this session. Multi-session per-file work to reach zero. Tracked in P1.2 task.

Sources of truth

docs/defensibility-roadmap.md — the 14-criterion scoreboard, updated each round.

docs/adr/ — 34 ADRs covering BYOK encryption, cross-tenant defense, audit signing, ratchet discipline, etc.

.github/workflows/ci.yml — 17 active CI ratchets, each with deliberate-break verification (per ADR-0028).

benchmarks/ — public benchmarks with reproducible measurement scripts.

docs/soc2-starter-pack.md — SOC 2 Type 1 calendar, vendor comparison, control map.

Synthetic uptime probe history — public Actions runs, hourly probe of 3 production endpoints.

/blog — 12 engineering blog posts covering audit + ratchets + security postmortems.

Found a discrepancy between this page and the underlying receipt? security@evalguard.ai — we'd rather correct it than leave it.

On “world's best engineering”

We use a 14-criterion scoreboard rather than a marketing superlative because superlatives can't be verified. A “world's best” claim is worth what its receipts are worth. The number above (8 of 14 earned) is honest — externally checkable from this repo. We'd rather earn the claim line by line than assert it.