Dataset Versioning

Immutable per-dataset snapshots. Lock in a frozen case set so re-running an experiment a month later hits the same inputs, bit-for-bit.

Available via REST, the evalguard datasets CLI, the Python SDK, and the dashboard Versions panel on/dashboard/datasets.

Why this matters

Eval results are only meaningful if you can reproduce them. The moment a teammate edits a case mid-experiment, the "baseline" you compared against last week no longer exists. Snapshots solve that — each one is append-only and hash-stable, so an experiment pinned to a version replays against the same cases for as long as the row exists.

What gets snapshotted

Every snapshot stores the FULL case set as an inline JSONB array on the dataset_versions row. Each case carries its id, input, expected_output, and metadata — exactly the columnsdataset_cases exposes. Deleting a case from the live dataset later does not affect a version row that contains it.

Snapshots are deduplicated by content hash. Callingsnapshot twice in a row when nothing has changed returns { unchanged: true } without writing a new row.

REST

Take a snapshot

POST /api/v1/datasets/<id>/versions

{
  "description": "before adding adversarial cases"
}

# 200 OK — fresh snapshot
{ "data": { "unchanged": false, "version": { "id": "...", "version_num": 3, ... } } }

# 200 OK — content matches latest
{ "data": { "unchanged": true,  "version": { "version_num": 2, ... }, "message": "..." } }

List + read

GET /api/v1/datasets/<id>/versions
GET /api/v1/datasets/<id>/versions/<versionId>
GET /api/v1/datasets/<id>/versions/<versionId>/diff?to=<otherVersionId>

Restore (auto-snapshots pre-restore state)

POST /api/v1/datasets/<id>/versions/<versionId>/restore

# 200 OK
{ "data": { "restoredFromVersion": 2, "caseCount": 42, "preRestoreVersionNum": 6 } }

CLI

evalguard datasets — bundled with the main CLI

# List snapshots
evalguard datasets versions <datasetId>

# Snapshot the current state
evalguard datasets snapshot <datasetId> --description "v3 — adds adversarial cases"

# Fetch one version (omit --cases to skip the JSONB payload)
evalguard datasets get <datasetId> <versionId> --cases

# Restore (use --yes in CI/scripts to skip the confirm prompt)
evalguard datasets restore <datasetId> <versionId> --yes

# Diff two snapshots — added / removed / modified counts + sample changes
evalguard datasets diff <datasetId> <fromVersionId> <toVersionId> --json

Python SDK

evalguard-sdk-python

from evalguard import EvalGuardClient

client = EvalGuardClient(api_key="eg_live_...")

# Snapshot
snap = client.snapshot_dataset("dataset-uuid", description="release candidate")
if snap["unchanged"]:
    print(f"No change since v{snap['version']['version_num']}")
else:
    print(f"Snapshotted as v{snap['version']['version_num']}")

# Compare
diff = client.diff_dataset_versions("dataset-uuid", from_id, to_id)
print(diff["diff"])   # { added, removed, modified, unchanged, sampleChanges }

# Pin an experiment to a snapshot for reproducible re-runs
client.run_eval({
    "source": "dataset_version",
    "datasetId":  "dataset-uuid",
    "datasetVersionId": from_id,
    "models": ["gpt-4o-mini"],
    "scorers": ["faithfulness"],
})

Reproducible experiments

POST /api/v1/experiments withsource: "dataset_version"+datasetVersionId. The experiment's case set is loaded from the snapshot, not from the live dataset_cases table — so the re-run hits the same inputs even if the dataset has been edited since.

Authorization model

SELECT — any org member of the dataset's project can list/get/diff versions.
INSERT (snapshot, restore) — admin+ only. Enforced by RLS.
UPDATE / DELETE — never. Both a database trigger and an RLS policy block them. Versions are immutable; the only way a row disappears is when its parent dataset is deleted (CASCADE).
Audit — dataset.version.snapshot and dataset.version.restore log into the standard audit table.

Vs the competition

Promptfoo / DeepEval — no versioning at all. Re-runs are best-effort.
LangSmith / Braintrust — snapshot + restore exist but as proprietary cloud-only features. EvalGuard ships them as first-class REST/CLI/SDK surfaces that work on self-hosted too.
Pinning — most platforms let you tag a dataset; few let you pin an EXPERIMENT to a frozen version so the eval re-run is bit-perfect reproducible. We do.