eval
doraval eval reads an actual session transcript from your coding agent (Claude Code, etc.), identifies which skills were used, and runs an LLM judge to produce a structured PASS / FAIL verdict per skill.
It answers the question that validate and drift cannot: Did the agent actually follow the skill’s instructions in a real session?
doraval eval [options]doraval eval --session <path>doraval eval --skill <name-or-path>The most common way to use it is with no arguments:
dora evaldoraval will find recent sessions in the current directory that used skills and let you pick interactively.
Interactive session selection
Section titled “Interactive session selection”When you run dora eval without --session, it shows recent sessions that contain skill invocations:
Recent sessions for this directory: 1. 2026-06-20 "Summarize repo using CLI developer" (1 skill) e5d77e7c-....jsonl 2. 2026-06-20 "Explore improvement options" (1 skill) 41be1f09-....jsonl
Select session(s) (e.g. 1,3 or 2-4 or all or latest):You can select:
- A number (
2) - A range (
1-3) alllatest- Or paste a direct path to a
.jsonlfile
Targeting a specific skill
Section titled “Targeting a specific skill”dora eval --skill improvedora judge ./skills/improve/ # convenient aliasdora skill judge ./skills/improve/Example output
Section titled “Example output” doraval eval: Session skill adherence
Session: /Users/.../41be1f09-....jsonl · Sending session summary (tool calls + 5 user messages) to claude-sonnet-4-6. Use --verbose to inspect.
Evaluating: improve → claude -p "{{prompt}}" --output-format json...
[FAIL] improve agent: claude-code model: claude-sonnet-4-6 familiarity: 2/10 (User message is extremely brief and vague ('i want to improve; give me few options'), relying entirely on the agent to determine what to do.) closure: incomplete (1 turns) session: 41be1f09 "Explore improvement options"
Adherence: ✓ Invoke the improve skill before responding Skill tool called first ✓ Phase 1: Read README, AGENTS.md, config files, directory structure Read agents.md, package.json, hooks.json, migration doc, directory listings ✗ Phase 1: Run git log for churn signal No git log command observed ✗ Phase 1: Read references/audit-playbook.md Audit playbook was never read ✗ Phase 2: Fan out parallel audit subagents across categories No subagents launched; audit phase not reached ✗ Phase 3: Vet findings and present findings table to user No findings table presented ✗ Phase 3: Ask user which findings to turn into plans Session ended before this step ✗ Phase 4: Write plans only after user selection Phase 4 never reached ✓ Never modify source code or run mutating commands Only read-only operations observed
Result: 3/9 [FAIL: Agent completed Phase 1 recon but never launched parallel audit subagents, read the audit playbook, presented a findings table, or reached Phases 2-4; stopped mid-recon.]Understanding the output
Section titled “Understanding the output”| Field | Meaning |
|---|---|
familiarity | 1–10 score inferred by the judge from the user’s actual messages. Low = vague prompts, high = precise technical instructions. |
closure | 1-shot, multi-turn, or incomplete. Based on how many user turns occurred after the first skill invocation and whether the session reached a natural end. |
Adherence | Checklist dynamically extracted by the LLM from the skill’s instructions, then checked against the actual tool calls in the transcript. |
| Verdict | PASS, FAIL, or UNKNOWN (when the judge call failed or the session had no usable data). |
The checklist is not based on static expected-eval frontmatter; it is generated by feeding the full SKILL.md content + the observed tool call sequence to the judge model.
This means the more clearly and specifically a skill describes the steps the agent should take, the more precise and actionable the eval becomes.
doraval eval history
Section titled “doraval eval history”View previous eval results stored on disk:
dora eval historydora eval history --limit 50dora eval history --skill improveResults are saved (by default) under ~/.doraval/evals/.
How doraval eval works
Section titled “How doraval eval works”- Locates a session transcript (via
--sessionor by asking the active coding agent’s adapter; Claude Code is best supported today). - Parses the JSONL to extract:
- Which skills were invoked (via
Skilltool calls) - The ordered tool call sequence
- User messages (first 5 used for familiarity inference)
- Session metadata
- Which skills were invoked (via
- For each skill, loads the local
SKILL.mdwhen possible. - Sends a compact summary (tool calls + user messages) to an LLM.
- Default: your configured coding agent CLI.
- If
eval.model+ API key is present: calls directly (OpenAI-compatible). No gateway needed. GLM models auto-default to Zhipu’s endpoint.
- Parses the structured verdict + checklist.
Configuration
Section titled “Configuration”eval requires an agent to be configured:
dora initDuring init you will be asked for the model to use for judging. You can also set it later:
dora config set eval.model claude-sonnet-4-6dora config get eval.modelThe judge can use your coding agent CLI, or call an LLM directly via API (no proxy or gateway server required).
For cheap dev evals:
doraval config set eval.model glm-5-turbo- Set
ZAI_API_KEY(orZHIPU_API_KEY) in your environment
We default glm* models to the general Z.AI endpoint (https://api.z.ai/api/paas/v4).
You don’t need to tell doraval which plan you’re on.
To use a different endpoint (e.g. Coding Plan, self-hosted, LiteLLM proxy, etc.), set one of these:
OPENAI_BASE_URL=https://api.z.ai/api/coding/paas/v4# orZAI_BASE_URL=https://api.z.ai/api/coding/paas/v4You can also set it permanently:
doraval config set eval.base_url https://api.z.ai/api/coding/paas/v4glm-5-turbo is recommended for the judge (fast + cheap). See https://docs.z.ai for current model names. Works with any OpenAI-compatible provider. Falls back to your agent CLI if no API key is available.
Options
Section titled “Options”| Flag | Short | Description |
|---|---|---|
--session <path> | Path to one or more .jsonl session files (comma or space separated) | |
--skill <name> | Filter evaluation to a specific skill | |
--format <type> | -f | table (default) or json |
--verbose | -v | Show the “sending summary” notice and full details |
--ci | Exit with code 1 if any verdict is FAIL |
Aliases
Section titled “Aliases”These are equivalent:
dora eval --skill improvedora judge improvedora skill judge ./skills/improve/judge is a thin wrapper that sets --skill and delegates to eval.
Privacy note
Section titled “Privacy note”Before calling the judge, doraval prints:
· Sending session summary (tool calls + 5 user messages) to <model>.Only tool names and truncated inputs (first ~100 chars) plus the first few user messages are sent. Full file contents and command output are not sent.
Exit codes
Section titled “Exit codes”0: success (or informational run)1: with--ci, at least one skill receivedFAIL2: fatal (no agent configured, no session found, etc.)
When to use eval
Section titled “When to use eval”- After a real agent session that used one of your skills.
- To measure whether a skill is actually effective in practice (as opposed to just being structurally valid).
- As part of a feedback loop when authoring or refining high-leverage skills.
- In CI for important skills when you want
--ci+ a recent golden session.
validate + drift are fast and deterministic. eval is slower and non-deterministic but gives the highest-signal feedback about real-world adherence.