Eval Cases

Eval cases are individual test cases within an evaluation file. Each case defines input messages, expected outcomes, and optional evaluator overrides.

Basic Structure

evalcases:
  - id: addition
    expected_outcome: Correctly calculates 15 + 27 = 42

    input: What is 15 + 27?

    expected_output: "42"

Fields

Field	Required	Description
`id`	Yes	Unique identifier for the eval case
`expected_outcome`	Yes	Description of what a correct response should contain
`input`	Yes	Input sent to the target (string, object, or message array). Alias: `input_messages`
`expected_output`	No	Expected response for comparison (string, object, or message array). Alias: `expected_messages`
`execution`	No	Per-case execution overrides (target, evaluators)
`rubrics`	No	Structured evaluation criteria
`sidecar`	No	Additional metadata passed to evaluators

Input

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
  - role: system
    content: You are a helpful math tutor.
  - role: user
    content: What is 15 + 27?

Expected Output

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
  - role: assistant
    content: "42"

Per-Case Execution Overrides

Override the default target or evaluators for specific cases:

evalcases:
  - id: complex-case
    expected_outcome: Provides detailed explanation
    input: Explain quicksort algorithm

    execution:
      target: gpt4_target
      evaluators:
        - name: depth_check
          type: llm_judge
          prompt: ./judges/depth.md

File References

Include external files in message content using array format:

input:
  - role: user
    content:
      - type: text
        value: Review this code against our guidelines.
      - type: file
        value: ./guidelines.md

Supported file path formats:

Format	Resolution
`./path.md`	Relative to eval file directory
`../dir/file.md`	Relative path with parent traversal
`/docs/file.md`	Absolute from repository root
`https://github.com/...`	GitHub blob URL (cloned and cached)
`https://gitlab.com/...`	GitLab blob URL (cloned and cached)
`https://bitbucket.org/...`	Bitbucket src URL (cloned and cached)

Git URLs are cloned once and cached in ~/.agentv/cache/repos/.

Sidecar Metadata

Pass additional context to evaluators via the sidecar field:

evalcases:
  - id: code-gen
    expected_outcome: Generates valid Python
    sidecar:
      language: python
      difficulty: medium
    input: Write a function to sort a list