Skip to content

eval

doraval eval reads an actual session transcript from your coding agent (Claude Code, etc.), identifies which skills were used, and runs an LLM judge to produce a structured PASS / FAIL verdict per skill.

It answers the question that validate and drift cannot: Did the agent actually follow the skill’s instructions in a real session?

Terminal window
doraval eval [options]
doraval eval --session <path>
doraval eval --skill <name-or-path>

The most common way to use it is with no arguments:

Terminal window
dora eval

doraval will find recent sessions in the current directory that used skills and let you pick interactively.

When you run dora eval without --session, it shows recent sessions that contain skill invocations:

Recent sessions for this directory:
1. 2026-06-20 "Summarize repo using CLI developer" (1 skill) e5d77e7c-....jsonl
2. 2026-06-20 "Explore improvement options" (1 skill) 41be1f09-....jsonl
Select session(s) (e.g. 1,3 or 2-4 or all or latest):

You can select:

  • A number (2)
  • A range (1-3)
  • all
  • latest
  • Or paste a direct path to a .jsonl file
Terminal window
dora eval --skill improve
dora judge ./skills/improve/ # convenient alias
dora skill judge ./skills/improve/
Terminal window
doraval eval: Session skill adherence
Session: /Users/.../41be1f09-....jsonl
· Sending session summary (tool calls + 5 user messages) to claude-sonnet-4-6. Use --verbose to inspect.
Evaluating: improve
claude -p "{{prompt}}" --output-format json...
[FAIL] improve
agent: claude-code
model: claude-sonnet-4-6
familiarity: 2/10 (User message is extremely brief and vague ('i want to improve; give me few options'), relying entirely on the agent to determine what to do.)
closure: incomplete (1 turns)
session: 41be1f09 "Explore improvement options"
Adherence:
Invoke the improve skill before responding Skill tool called first
Phase 1: Read README, AGENTS.md, config files, directory structure Read agents.md, package.json, hooks.json, migration doc, directory listings
Phase 1: Run git log for churn signal No git log command observed
Phase 1: Read references/audit-playbook.md Audit playbook was never read
Phase 2: Fan out parallel audit subagents across categories No subagents launched; audit phase not reached
Phase 3: Vet findings and present findings table to user No findings table presented
Phase 3: Ask user which findings to turn into plans Session ended before this step
Phase 4: Write plans only after user selection Phase 4 never reached
Never modify source code or run mutating commands Only read-only operations observed
Result: 3/9 [FAIL: Agent completed Phase 1 recon but never launched parallel audit subagents, read the audit playbook, presented a findings table, or reached Phases 2-4; stopped mid-recon.]
FieldMeaning
familiarity1–10 score inferred by the judge from the user’s actual messages. Low = vague prompts, high = precise technical instructions.
closure1-shot, multi-turn, or incomplete. Based on how many user turns occurred after the first skill invocation and whether the session reached a natural end.
AdherenceChecklist dynamically extracted by the LLM from the skill’s instructions, then checked against the actual tool calls in the transcript.
VerdictPASS, FAIL, or UNKNOWN (when the judge call failed or the session had no usable data).

The checklist is not based on static expected-eval frontmatter; it is generated by feeding the full SKILL.md content + the observed tool call sequence to the judge model.

This means the more clearly and specifically a skill describes the steps the agent should take, the more precise and actionable the eval becomes.

View previous eval results stored on disk:

Terminal window
dora eval history
dora eval history --limit 50
dora eval history --skill improve

Results are saved (by default) under ~/.doraval/evals/.

  1. Locates a session transcript (via --session or by asking the active coding agent’s adapter; Claude Code is best supported today).
  2. Parses the JSONL to extract:
    • Which skills were invoked (via Skill tool calls)
    • The ordered tool call sequence
    • User messages (first 5 used for familiarity inference)
    • Session metadata
  3. For each skill, loads the local SKILL.md when possible.
  4. Sends a compact summary (tool calls + user messages) to an LLM.
    • Default: your configured coding agent CLI.
    • If eval.model + API key is present: calls directly (OpenAI-compatible). No gateway needed. GLM models auto-default to Zhipu’s endpoint.
  5. Parses the structured verdict + checklist.

eval requires an agent to be configured:

Terminal window
dora init

During init you will be asked for the model to use for judging. You can also set it later:

Terminal window
dora config set eval.model claude-sonnet-4-6
dora config get eval.model

The judge can use your coding agent CLI, or call an LLM directly via API (no proxy or gateway server required).

For cheap dev evals:

  • doraval config set eval.model glm-5-turbo
  • Set ZAI_API_KEY (or ZHIPU_API_KEY) in your environment

We default glm* models to the general Z.AI endpoint (https://api.z.ai/api/paas/v4).

You don’t need to tell doraval which plan you’re on.

To use a different endpoint (e.g. Coding Plan, self-hosted, LiteLLM proxy, etc.), set one of these:

Terminal window
OPENAI_BASE_URL=https://api.z.ai/api/coding/paas/v4
# or
ZAI_BASE_URL=https://api.z.ai/api/coding/paas/v4

You can also set it permanently:

Terminal window
doraval config set eval.base_url https://api.z.ai/api/coding/paas/v4

glm-5-turbo is recommended for the judge (fast + cheap). See https://docs.z.ai for current model names. Works with any OpenAI-compatible provider. Falls back to your agent CLI if no API key is available.

FlagShortDescription
--session <path>Path to one or more .jsonl session files (comma or space separated)
--skill <name>Filter evaluation to a specific skill
--format <type>-ftable (default) or json
--verbose-vShow the “sending summary” notice and full details
--ciExit with code 1 if any verdict is FAIL

These are equivalent:

Terminal window
dora eval --skill improve
dora judge improve
dora skill judge ./skills/improve/

judge is a thin wrapper that sets --skill and delegates to eval.

Before calling the judge, doraval prints:

· Sending session summary (tool calls + 5 user messages) to <model>.

Only tool names and truncated inputs (first ~100 chars) plus the first few user messages are sent. Full file contents and command output are not sent.

  • 0: success (or informational run)
  • 1: with --ci, at least one skill received FAIL
  • 2: fatal (no agent configured, no session found, etc.)
  • After a real agent session that used one of your skills.
  • To measure whether a skill is actually effective in practice (as opposed to just being structurally valid).
  • As part of a feedback loop when authoring or refining high-leverage skills.
  • In CI for important skills when you want --ci + a recent golden session.

validate + drift are fast and deterministic. eval is slower and non-deterministic but gives the highest-signal feedback about real-world adherence.