eval

doraval eval reads an actual session transcript from your coding agent (Claude Code, etc.), identifies which skills were used, and runs an LLM judge to produce a structured PASS / FAIL verdict per skill.

It answers the question that validate and drift cannot: Did the agent actually follow the skill’s instructions in a real session?

Usage

doraval eval [options]
doraval eval --session <path>
doraval eval --skill <name-or-path>

The most common way to use it is with no arguments:

dora eval

doraval will find recent sessions in the current directory that used skills and let you pick interactively.

Interactive session selection

When you run dora eval without --session, it shows recent sessions that contain skill invocations:

  Recent sessions for this directory:
    1. 2026-06-20 "Summarize repo using CLI developer" (1 skill)  e5d77e7c-....jsonl
    2. 2026-06-20 "Explore improvement options" (1 skill)  41be1f09-....jsonl

  Select session(s) (e.g. 1,3 or 2-4 or all or latest):

You can select:

A number (2)
A range (1-3)
all
latest
Or paste a direct path to a .jsonl file

Targeting a specific skill

dora eval --skill improve
dora judge ./skills/improve/          # convenient alias
dora skill judge ./skills/improve/

Example output

  doraval eval: Session skill adherence

  Session: /Users/.../41be1f09-....jsonl
  · Sending session summary (tool calls + 5 user messages) to claude-sonnet-4-6. Use --verbose to inspect.

  Evaluating: improve
  → claude -p "{{prompt}}" --output-format json...

  [FAIL] improve
  agent:       claude-code
  model:       claude-sonnet-4-6
  familiarity: 2/10  (User message is extremely brief and vague ('i want to improve; give me few options'), relying entirely on the agent to determine what to do.)
  closure:     incomplete (1 turns)
  session:     41be1f09  "Explore improvement options"

  Adherence:
  ✓ Invoke the improve skill before responding  Skill tool called first
  ✓ Phase 1: Read README, AGENTS.md, config files, directory structure  Read agents.md, package.json, hooks.json, migration doc, directory listings
  ✗ Phase 1: Run git log for churn signal  No git log command observed
  ✗ Phase 1: Read references/audit-playbook.md  Audit playbook was never read
  ✗ Phase 2: Fan out parallel audit subagents across categories  No subagents launched; audit phase not reached
  ✗ Phase 3: Vet findings and present findings table to user  No findings table presented
  ✗ Phase 3: Ask user which findings to turn into plans  Session ended before this step
  ✗ Phase 4: Write plans only after user selection  Phase 4 never reached
  ✓ Never modify source code or run mutating commands  Only read-only operations observed

  Result: 3/9  [FAIL: Agent completed Phase 1 recon but never launched parallel audit subagents, read the audit playbook, presented a findings table, or reached Phases 2-4; stopped mid-recon.]

Understanding the output

Field	Meaning
`familiarity`	1–10 score inferred by the judge from the user’s actual messages. Low = vague prompts, high = precise technical instructions.
`closure`	`1-shot`, `multi-turn`, or `incomplete`. Based on how many user turns occurred after the first skill invocation and whether the session reached a natural end.
`Adherence`	Checklist dynamically extracted by the LLM from the skill’s instructions, then checked against the actual tool calls in the transcript.
Verdict	`PASS`, `FAIL`, or `UNKNOWN` (when the judge call failed or the session had no usable data).

The checklist is not based on static expected-eval frontmatter; it is generated by feeding the full SKILL.md content + the observed tool call sequence to the judge model.

This means the more clearly and specifically a skill describes the steps the agent should take, the more precise and actionable the eval becomes.

`doraval eval history`

View previous eval results stored on disk:

dora eval history
dora eval history --limit 50
dora eval history --skill improve

Results are saved (by default) under ~/.doraval/evals/.

How `doraval eval` works

Locates a session transcript (via --session or by asking the active coding agent’s adapter; Claude Code is best supported today).
Parses the JSONL to extract:
- Which skills were invoked (via Skill tool calls)
- The ordered tool call sequence
- User messages (first 5 used for familiarity inference)
- Session metadata
For each skill, loads the local SKILL.md when possible.
Sends a compact summary (tool calls + user messages) to an LLM.
- Default: your configured coding agent CLI.
- If eval.model + API key is present: calls directly (OpenAI-compatible). No gateway needed. GLM models auto-default to Zhipu’s endpoint.
Parses the structured verdict + checklist.

Configuration

eval requires an agent to be configured:

dora init

During init you will be asked for the model to use for judging. You can also set it later:

dora config set eval.model claude-sonnet-4-6
dora config get eval.model

The judge can use your coding agent CLI, or call an LLM directly via API (no proxy or gateway server required).

For cheap dev evals:

doraval config set eval.model glm-5-turbo
Set ZAI_API_KEY (or ZHIPU_API_KEY) in your environment

We default glm* models to the general Z.AI endpoint (https://api.z.ai/api/paas/v4).

You don’t need to tell doraval which plan you’re on.

To use a different endpoint (e.g. Coding Plan, self-hosted, LiteLLM proxy, etc.), set one of these:

OPENAI_BASE_URL=https://api.z.ai/api/coding/paas/v4
# or
ZAI_BASE_URL=https://api.z.ai/api/coding/paas/v4

You can also set it permanently:

doraval config set eval.base_url https://api.z.ai/api/coding/paas/v4

glm-5-turbo is recommended for the judge (fast + cheap). See https://docs.z.ai for current model names. Works with any OpenAI-compatible provider. Falls back to your agent CLI if no API key is available.

Options

Flag	Short	Description
`--session <path>`		Path to one or more `.jsonl` session files (comma or space separated)
`--skill <name>`		Filter evaluation to a specific skill
`--format <type>`	`-f`	`table` (default) or `json`
`--verbose`	`-v`	Show the “sending summary” notice and full details
`--ci`		Exit with code 1 if any verdict is `FAIL`

Aliases

These are equivalent:

dora eval --skill improve
dora judge improve
dora skill judge ./skills/improve/

judge is a thin wrapper that sets --skill and delegates to eval.

Privacy note

Before calling the judge, doraval prints:

· Sending session summary (tool calls + 5 user messages) to <model>.

Only tool names and truncated inputs (first ~100 chars) plus the first few user messages are sent. Full file contents and command output are not sent.

Exit codes

0: success (or informational run)
1: with --ci, at least one skill received FAIL
2: fatal (no agent configured, no session found, etc.)

When to use eval

After a real agent session that used one of your skills.
To measure whether a skill is actually effective in practice (as opposed to just being structurally valid).
As part of a feedback loop when authoring or refining high-leverage skills.
In CI for important skills when you want --ci + a recent golden session.

validate + drift are fast and deterministic. eval is slower and non-deterministic but gives the highest-signal feedback about real-world adherence.