Dataset Versioning
Immutable per-dataset snapshots. Lock in a frozen case set so re-running an experiment a month later hits the same inputs, bit-for-bit.
Available via REST, the evalguard datasets CLI, the Python SDK, and the dashboard Versions panel on/dashboard/datasets.
Why this matters
Eval results are only meaningful if you can reproduce them. The moment a teammate edits a case mid-experiment, the "baseline" you compared against last week no longer exists. Snapshots solve that — each one is append-only and hash-stable, so an experiment pinned to a version replays against the same cases for as long as the row exists.
What gets snapshotted
Every snapshot stores the FULL case set as an inline JSONB array on the dataset_versions row. Each case carries its id, input, expected_output, and metadata — exactly the columnsdataset_cases exposes. Deleting a case from the live dataset later does not affect a version row that contains it.
Snapshots are deduplicated by content hash. Callingsnapshot twice in a row when nothing has changed returns { unchanged: true } without writing a new row.
REST
POST /api/v1/datasets/<id>/versions
{
"description": "before adding adversarial cases"
}
# 200 OK — fresh snapshot
{ "data": { "unchanged": false, "version": { "id": "...", "version_num": 3, ... } } }
# 200 OK — content matches latest
{ "data": { "unchanged": true, "version": { "version_num": 2, ... }, "message": "..." } }GET /api/v1/datasets/<id>/versions GET /api/v1/datasets/<id>/versions/<versionId> GET /api/v1/datasets/<id>/versions/<versionId>/diff?to=<otherVersionId>
POST /api/v1/datasets/<id>/versions/<versionId>/restore
# 200 OK
{ "data": { "restoredFromVersion": 2, "caseCount": 42, "preRestoreVersionNum": 6 } }CLI
# List snapshots evalguard datasets versions <datasetId> # Snapshot the current state evalguard datasets snapshot <datasetId> --description "v3 — adds adversarial cases" # Fetch one version (omit --cases to skip the JSONB payload) evalguard datasets get <datasetId> <versionId> --cases # Restore (use --yes in CI/scripts to skip the confirm prompt) evalguard datasets restore <datasetId> <versionId> --yes # Diff two snapshots — added / removed / modified counts + sample changes evalguard datasets diff <datasetId> <fromVersionId> <toVersionId> --json
Python SDK
from evalguard import EvalGuardClient
client = EvalGuardClient(api_key="eg_live_...")
# Snapshot
snap = client.snapshot_dataset("dataset-uuid", description="release candidate")
if snap["unchanged"]:
print(f"No change since v{snap['version']['version_num']}")
else:
print(f"Snapshotted as v{snap['version']['version_num']}")
# Compare
diff = client.diff_dataset_versions("dataset-uuid", from_id, to_id)
print(diff["diff"]) # { added, removed, modified, unchanged, sampleChanges }
# Pin an experiment to a snapshot for reproducible re-runs
client.run_eval({
"source": "dataset_version",
"datasetId": "dataset-uuid",
"datasetVersionId": from_id,
"models": ["gpt-4o-mini"],
"scorers": ["faithfulness"],
})Reproducible experiments
POST /api/v1/experiments withsource: "dataset_version"+datasetVersionId. The experiment's case set is loaded from the snapshot, not from the live dataset_cases table — so the re-run hits the same inputs even if the dataset has been edited since.
Authorization model
- SELECT — any org member of the dataset's project can list/get/diff versions.
- INSERT (snapshot, restore) — admin+ only. Enforced by RLS.
- UPDATE / DELETE — never. Both a database trigger and an RLS policy block them. Versions are immutable; the only way a row disappears is when its parent dataset is deleted (CASCADE).
- Audit —
dataset.version.snapshotanddataset.version.restorelog into the standard audit table.
Vs the competition
- Promptfoo / DeepEval — no versioning at all. Re-runs are best-effort.
- LangSmith / Braintrust — snapshot + restore exist but as proprietary cloud-only features. EvalGuard ships them as first-class REST/CLI/SDK surfaces that work on self-hosted too.
- Pinning — most platforms let you tag a dataset; few let you pin an EXPERIMENT to a frozen version so the eval re-run is bit-perfect reproducible. We do.