Scorers

87 built-in scorers across 12 categories for evaluating LLM outputs.

Using Scorers

Scorers are specified by name in your eval config file or SDK call.

eval-config.json
{
  "name": "my-eval",
  "model": "gpt-4o",
  "prompt": "Answer: {{input}}",
  "scorers": ["exact-match", "faithfulness", "toxicity", "cost"],
  "cases": [
    { "input": "What is 2+2?", "expectedOutput": "4" }
  ]
}

Text Matching

Deterministic string comparison scorers.

exact-match

Exact string equality between output and expected

equals

Equality check with optional normalization

contains

Check if output contains expected string

contains-any

Check if output contains any of the expected strings

contains-all

Check if output contains all expected strings

icontains

Case-insensitive contains

icontains-any

Case-insensitive contains any

icontains-all

Case-insensitive contains all

starts-with

Check if output starts with expected prefix

ends-with

Check if output ends with expected suffix

regex-match

Match output against a regular expression

levenshtein

Levenshtein edit distance between output and expected

word-count

Check output word count against min/max bounds

length-check

Check output character length against bounds

Semantic

Embedding-based and LLM-based semantic comparison.

similar

Fuzzy string similarity (cosine, Jaccard, etc.)

semantic-similarity

Cosine similarity between embedding vectors

embedding-distance

Distance between output and expected embeddings

select-best

LLM selects the best output from multiple candidates

LLM-Based

LLM-as-judge scorers for complex quality assessments.

llm-grader

Custom LLM grading with your own rubric

g-eval

G-Eval framework for arbitrary evaluation criteria

faithfulness

Does the output faithfully represent the source?

relevance

Is the output relevant to the input query?

answer-relevance

Relevance of the answer to the specific question

factuality

Are the claims in the output factually correct?

hallucination

Does the output contain hallucinated information?

summarization

Quality of a summarization against the source

classifier

Classify output into custom categories

coherence

Logical flow and coherence of the output

fluency

Grammatical fluency and naturalness

completeness

Does the output fully address the query?

conciseness

Is the output appropriately concise?

readability

Readability score (Flesch-Kincaid, etc.)

arena-g-eval

Arena-style pairwise G-Eval comparison

general-task-completion

General task completion assessment

JSON & Structured

Validate structured output formats.

json-valid

Is the output valid JSON?

json-schema

Does the output match a JSON schema?

json-correctness

Semantic correctness of JSON output

contains-json

Does the output contain a JSON block?

contains-sql

Does the output contain SQL?

contains-xml

Does the output contain XML?

contains-html

Does the output contain HTML?

is-sql

Is the entire output valid SQL?

is-html

Is the entire output valid HTML?

is-valid-function-call

Is the output a valid function call?

NLP Metrics

Traditional NLP evaluation metrics.

rouge-n

ROUGE-N overlap between output and reference

bleu

BLEU score for translation / generation quality

gleu

Google-BLEU variant score

meteor

METEOR score with synonymy and stemming

perplexity

Language model perplexity of the output

MCP & Agentic

Scorers for multi-step agent and MCP tool use evaluation.

tool-correctness

Did the agent use the correct tools?

task-completion

Did the agent complete the assigned task?

mcp-task-completion

MCP-specific task completion metric

mcp-use

Correctness of MCP tool invocations

multi-turn-mcp-use

Multi-turn MCP tool use evaluation

goal-accuracy

Did the agent achieve the stated goal?

step-efficiency

How efficiently did the agent solve the task?

plan-adherence

Did the agent follow the expected plan?

plan-quality

Quality of the agent's planning

argument-correctness

Correctness of function call arguments

dag-evaluation

DAG-based evaluation of multi-step workflows

Conversation

Multi-turn conversation evaluation.

conversation

Overall conversation quality assessment

conversation-relevance

Turn-by-turn relevance in conversations

conversation-completeness

Did the conversation address all topics?

knowledge-retention

Does the model retain context across turns?

role-adherence

Does the model stay in its assigned role?

role-violation

Did the model break its role constraints?

RAG

Retrieval-augmented generation quality metrics.

context-faithfulness

Is the output faithful to retrieved context?

context-relevance

Is the retrieved context relevant to the query?

context-recall

Did the retriever find all relevant passages?

context-precision

Are retrieved passages precise and non-redundant?

Safety

Safety and content moderation scorers.

toxicity

Toxicity level of the output

bias

Bias detection in the output

non-advice

Does the output avoid giving dangerous advice?

misuse

Could the output enable misuse?

is-refusal

Did the model appropriately refuse a harmful request?

Multimodal

Image and multimodal output evaluation.

text-to-image

Quality of text-to-image generation

image-coherence

Visual coherence of generated images

image-helpfulness

Helpfulness of image-based outputs

image-reference

Accuracy against a reference image

image-editing

Quality of image editing operations

Performance

Cost, latency, and observability metrics.

cost

Token cost of the LLM call

latency

Response latency in milliseconds

trace-span-count

Number of spans in a trace

trace-span-duration

Duration of trace spans

trace-error-spans

Count of error spans in a trace

Custom

Bring your own scoring logic.

custom-function

Run a custom JavaScript/TypeScript function

webhook

Call an external webhook for scoring

Custom Scorers

Create custom scorers using the custom-function scorer or the webhook scorer.

Custom Function

eval-config.json
{
  "scorers": ["custom-function"],
  "scorerOptions": {
    "custom-function": {
      "function": "return output.includes('Paris') ? 1.0 : 0.0"
    }
  }
}

Webhook Scorer

eval-config.json
{
  "scorers": ["webhook"],
  "scorerOptions": {
    "webhook": {
      "url": "https://your-server.com/score",
      "headers": { "Authorization": "Bearer your-token" }
    }
  }
}