Engineering — EvalGuard

Engineering claims with receipts. Every number on this page is verifiable from outside the company — link to the file, the commit, or the workflow that produced it.

Last updated: 2026-05-05. Source of truth:docs/defensibility-roadmap.md.

Doing diligence? The 15-minute walkthrough at /verify walks you through reproducing every claim below — clone, run, see the same numbers. No accounts needed.

Defensibility scoreboard

Earned

Partial

Calendar-bound

Blocked

We track 14 binary success criteria for the “world's best engineering” claim. Each is verifiable from outside the company. Status:

#1. 7+ day green-main streakCalendar
Calendar — passive accrual.
#2. 0 lint warningsPartial
825 → 288 (-65% session). Multi-day finish.
#3. < 100 silent skipsEarned
13 silent (was 154). Commit `5ca0cb5c`.
#4. Critical-path --strict (95% lines / 90% branches)Earned
api-handler 91.7%, crypto 100%, audit 94.1%. Commit `46994c35`.
#5. Mutation score > 85% on 3 critical filesPartial
Scope expanded 3 → 8 files in ratchet 19. crypto 96.55%, audit 89.66%, api-handler 44.89%, detection-engine 20.60%, rule-builder 58.17%, statistics 50.92% (was 6.89%), ml-classifier 45.69% (was 0%), guardrail-dsl 10.20% (was 0%). 2 above the 85% bar; rest are earned-then-enforce. Phase 2 (2026-05-06) added 124 direct unit tests across 3 files; lifts: statistics +44pp / ml-classifier +46pp / guardrail-dsl +10pp. Doubles the scope of mutation testing.
#6. SOC 2 Type 1 attestationBlocked
Evidence engine live + gap analysis done; auditor engagement gated on funding, attestation target Q4 2026. See`docs/soc2-starter-pack.md`.
#7. External synthetic uptime checksPartial
Workflow live + first green run. Earnable after 24h of probes. Commit `4dc1d1eb`.
#8. Sustained weekly blog cadence (12 posts)Calendar
Volume bar earned (12/12). Sustained-cadence test starts 2026-05-12 with post 13.
#9. 3+ OSS packages with downloadsEarned
4 packages live: 444 downloads/week. evalguardai-{openai, anthropic, otel}, @evalguard/sdk.
#10. Public head-to-head benchmarksEarned
3 reproducible benchmarks: firewall-latency (p95=1.11ms), firewall-detection-quality (100%/100%/100% on 200-prompt corpus, doubled + sourdough FP closed 2026-05-06), NeMo Guardrails head-to-head (1st independent). Commit `8637e975`.
#11. OpenAPI spec — 100% public-route coverageEarned
310/311 routes documented. 1 allowlisted (404 catch-all). Commit `878476e8`.
#12. 12+ engineering blog postsEarned
12 published. /blog index.
#13. 30+ ADRs in repoEarned
37 ADRs. `docs/adr/`. ADR-0001 through ADR-0037.
#14. SBOM + security.txt + vulnerability disclosureEarned
Daily syft + grype CycloneDX SBOM. RFC 9116 security.txt. Public disclosure policy.

The numbers

Active CI ratchets

Receipt: .github/workflows/ci.yml

Architecture Decision Records

Receipt: docs/adr/

OSS packages on npm

Receipt: registry.npmjs.org

444

Weekly OSS downloads

Receipt: api.npmjs.org

310 / 311 (100%)

Public API routes documented

Receipt: apps/web/public/openapi.json

76.83%

Critical-path mutation score (avg)

Receipt: scripts/.mutation-score-baseline.json

1.13 ms

Firewall p95 latency (real CI)

Receipt: scripts/.firewall-latency-baseline.json

288

Lint warnings (down from 825)

Receipt: session deltas — chip pass discipline

Silent (undocumented) test skips

Receipt: scripts/.skip-baseline.json

Engineering blog posts

Receipt: /blog

Claims with receipts

Customer audits ask specific questions. Here are 8 of them with our specific answers and the file you can read to verify.

Q1. How is multi-tenant data isolated?

Three layers: Postgres RLS on every owned-resource table, every Supabase query chains `.eq("org_id", orgId)` (belt + suspenders), and a static-analysis CI ratchet asserts every owned-table query carries the predicate.

Receipt: ADR-0014 + scripts/cross-tenant-eq-check.cjs (lower-only floor at 137 chains)

Q2. What's your encryption-at-rest story?

AES-256-GCM with a 12-byte random IV per encryption, GCM auth tag stored alongside ciphertext. Implementation 100% line + branch coverage. Key sourced from secrets manager, wiped from process memory after first load.

Receipt: ADR-0008 + apps/web/src/lib/crypto.ts (96.55% mutation score)

Q3. How fast is the firewall?

p95 = 1.13 ms in real CI, p50 = 2.35 ms, against 4-layer scan (pattern + token + semantic + output). Public SLA p95 ≤ 50 ms. Regression-gate ratchet (Ratchet 17) fails CI on degradation > 1.5× baseline.

Receipt: scripts/firewall-latency-ratchet.cjs + benchmarks/firewall-latency.md

Q4. What does your CI catch?

17 active ratchets. Hard zero: RLS coverage, force-dynamic, TODO/FIXME, `as any`, dynamic eval, mass-assignment. Hard validation: lint, security.txt freshness, counts-canonical match, --strict critical-path coverage, firewall latency. Lower-only: skip-count (silent: 13), OpenAPI coverage (0 missing), cross-tenant-eq (137 floor), coverage baseline.

Receipt: .github/workflows/ci.yml — Ratchets job

Q5. How do you prevent BOLA / IDOR?

Auth gate in createApiHandler middleware (97.1% line / 91.7% branch coverage). Cross-tenant `.eq("org_id")` on every owned-resource query (Layer 2 of ADR-0014). Static-analysis ratchet enforces zero regressions.

Receipt: apps/web/src/lib/api-handler.ts + ADR-0014

Q6. Audit log integrity?

Every audit row carries an entry_signature column — HMAC-SHA256 of the row's content fields signed with AUDIT_SIGNING_KEY. Verifier (cron + on-demand /api/v1/audit-logs/verify) recomputes and compares constant-time. Versioned column list. Detects insider tampering even from service-role-key compromise.

Receipt: ADR-0023 + apps/web/src/lib/audit-logger.ts (89.66% mutation score)

Q7. How do you prevent SSRF in webhooks?

assertPublicUrl(url) helper. Blocklist: localhost, 127.0.0.1, 0.0.0.0, 169.254.169.254 (AWS IMDS), metadata.google.internal, metadata.azure.com. DNS resolution check for private IPv4 / IPv6 link-local + ULA. Protocol whitelist: http/https only. DNS-rebinding defense via re-resolution at fetch time.

Receipt: ADR-0015 + packages/core/src/security/ssrf-guard.ts

Q8. Where do design decisions live?

/docs/adr/ — 34 numbered architecture decision records. Each captures status, date, tags, context (forcing function), decision with alternatives considered, consequences (what it makes easy/hard, review triggers, references). PR-reviewed.

Receipt: docs/adr/README.md — ADR-0001 through ADR-0034

What we don't yet have

Honest gaps. Each is being addressed; none are being hidden.

api-handler.ts mutation score: 44.29% — the load-bearing API middleware has high line + branch coverage (97.1% / 91.7%) but mutation testing reveals 278 surviving mutants, mostly StringLiteral mutations on log messages and switch-case branch labels. Multi-day work to close. Tracked in scripts/.mutation-score-baseline.json.
Firewall detection-QUALITY benchmark — we publish latency (1.13 ms p95) but not yet recall/precision against a public attack corpus (HarmBench / GarakAI). A fast firewall that misses attacks is worse than a slow one that catches them. Detection-quality benchmark is the next P4.3 work item.
SOC 2 Type 1 attestation — gap analysis + control-to-TSC mapping done (leaning Drata), in docs/soc2-starter-pack.md, and the evidence engine is live; the auditor engagement is gated on funding. Attestation target Q4 2026. The /security page does not claim any SOC 2 status until the auditor letter is signed.
External pentest — none commissioned yet. Planned post-Type-1 attestation using a HackerOne or Big-4 firm.
Lint warnings: 288 remaining — down from 825 (-65%) across 3 chip passes this session. Multi-session per-file work to reach zero. Tracked in P1.2 task.

Sources of truth

docs/defensibility-roadmap.md — the 14-criterion scoreboard, updated each round.

docs/adr/ — 34 ADRs covering BYOK encryption, cross-tenant defense, audit signing, ratchet discipline, etc.

.github/workflows/ci.yml — 17 active CI ratchets, each with deliberate-break verification (per ADR-0028).

benchmarks/ — public benchmarks with reproducible measurement scripts.

docs/soc2-starter-pack.md — SOC 2 Type 1 calendar, vendor comparison, control map.

Synthetic uptime probe history — public Actions runs, hourly probe of 3 production endpoints.

/blog — 12 engineering blog posts covering audit + ratchets + security postmortems.

Found a discrepancy between this page and the underlying receipt? security@evalguard.ai — we'd rather correct it than leave it.

On “world's best engineering”