OpenAI

openai

YAML config

providers:
  - id: openai:gpt-4o-mini
    config:
      apiKey: ${OPENAI_API_KEY}

TypeScript usage

import { createProvider } from "@evalguard/core";

const provider = createProvider("openai", process.env.OPENAI_API_KEY);
const response = await provider.complete({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Hello" }],
});

Authentication

Set OPENAI_API_KEY in your environment. EvalGuard validates the key on first call and surfaces typed errors for 401 / 403 / rate-limit responses (with Retry-After parsing).

Setup walkthrough

1. Generate an API key at https://platform.openai.com/api-keys (or use your org's existing one).
2. Set the key in your environment: `export OPENAI_API_KEY=sk-...` for local; for prod, store via your secret manager (AWS Secrets Manager, GCP Secret Manager, Vault).
3. (Optional but recommended) Store the key in EvalGuard's BYOK vault via `/dashboard/api-keys` — the key never appears in logs, traces, or worker memory.
4. Configure the provider in your eval YAML: `providers: [{id: 'openai:gpt-4o-mini', config: {apiKey: '${OPENAI_API_KEY}'}}]`.
5. Smoke test: `evalguard run --provider openai:gpt-4o-mini --prompt 'Hello'` should return a response.

Gotchas

Tier 1 rate limits (newly-funded org) cap at 500 RPM for gpt-4o-mini. Plan eval batch sizes accordingly or upgrade tier.
Vision models (gpt-4o, gpt-4o-mini) accept `image_url` parts but the URL must be publicly fetchable from OpenAI's infrastructure — base64 data: URIs work for private images.
`response_format: {type: 'json_object'}` requires the word 'json' in the system or user prompt or the API returns 400. Use `json_schema` for stricter structured-output enforcement.
Prompt caching applies automatically on prompts ≥1024 tokens, including the system prompt. Reorder dynamic content to AFTER the static prompt prefix to maximize cache hits.

Cost note

gpt-4o-mini: $0.15/M input, $0.60/M output. gpt-4o: $2.50/M input, $10/M output. Prompt caching gives 50% discount on cached prefix tokens. Track per-eval cost via `cost_ledger` (T4.2).

Recommended models

Eval / judge: gpt-4o-mini — best price/quality for scorer-as-judge
Agent / tool-use: gpt-4o — function calling reliability + 128K context
Code: gpt-4o — best at multi-language code generation
Vision: gpt-4o — strong OCR + UI understanding

Hand-written · 2026-05-21