Eval Cases
Eval cases are individual test cases within an evaluation file. Each case defines input messages, expected outcomes, and optional evaluator overrides.
Basic Structure
Section titled “Basic Structure”evalcases: - id: addition expected_outcome: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"Fields
Section titled “Fields”| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier for the eval case |
expected_outcome | Yes | Description of what a correct response should contain |
input | Yes | Input sent to the target (string, object, or message array). Alias: input_messages |
expected_output | No | Expected response for comparison (string, object, or message array). Alias: expected_messages |
execution | No | Per-case execution overrides (target, evaluators) |
rubrics | No | Structured evaluation criteria |
sidecar | No | Additional metadata passed to evaluators |
The simplest form is a string, which expands to a single user message:
input: What is 15 + 27?For multi-turn or system messages, use a message array:
input: - role: system content: You are a helpful math tutor. - role: user content: What is 15 + 27?Expected Output
Section titled “Expected Output”Optional reference response for comparison by evaluators. A string expands to a single assistant message:
expected_output: "42"For structured or multi-message expected output, use a message array:
expected_output: - role: assistant content: "42"Per-Case Execution Overrides
Section titled “Per-Case Execution Overrides”Override the default target or evaluators for specific cases:
evalcases: - id: complex-case expected_outcome: Provides detailed explanation input: Explain quicksort algorithm
execution: target: gpt4_target evaluators: - name: depth_check type: llm_judge prompt: ./judges/depth.mdFile References
Section titled “File References”Include external files in message content using array format:
input: - role: user content: - type: text value: Review this code against our guidelines. - type: file value: ./guidelines.mdSupported file path formats:
| Format | Resolution |
|---|---|
./path.md | Relative to eval file directory |
../dir/file.md | Relative path with parent traversal |
/docs/file.md | Absolute from repository root |
https://github.com/... | GitHub blob URL (cloned and cached) |
https://gitlab.com/... | GitLab blob URL (cloned and cached) |
https://bitbucket.org/... | Bitbucket src URL (cloned and cached) |
Git URLs are cloned once and cached in ~/.agentv/cache/repos/.
Sidecar Metadata
Section titled “Sidecar Metadata”Pass additional context to evaluators via the sidecar field:
evalcases: - id: code-gen expected_outcome: Generates valid Python sidecar: language: python difficulty: medium input: Write a function to sort a list