eval-harness
OfficialFormal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
What this skill does
When applied, it prepends a system prompt before your request is sent — no extra calls and no change to how you are billed beyond the added tokens.
--- name: eval-harness description: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles allowed-tools: Read, Write, Edit, Bash, Grep, Glob --- # Eval Harness Skill A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles. ## When to Activate - Setting up eval-driven development (EDD) for AI-assisted workflows - Defining pass/fail criteria for Claude Code task completion - Measuring agent reliability with pass@k metrics - Creating regression test suites for prompt or agent changes - Benchmarking agent performance across model versions ## Philosophy Eval-Driven Development treats evals as the "unit tests of AI development": - Define expected behavior BEFORE implementation - Run evals continuously during development - Track regressions with each change - Use pass@k metrics for reliability measurement ## Eval Types ### Capability Evals Test if Claude can do something it couldn't before: ```markdown [CAPABILITY EVAL: feature-name] Task: Description of what Claude should accomplish Success Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3 Expected Output: Description of expected result ``` ### Regression Evals Ensure changes don't break existing functionality: ```markdown [REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests: - existing-test-1: PASS/FAIL - existing-test-2: PASS/FAIL - existing-test-3: PASS/FAIL Result: X/Y passed (previously Y/Y) ``` ## Grader Types ### 1. Code-Based Grader Deterministic checks using code: ```bash # Check if file contains expected pattern grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL" # Check if tests pass npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL" # Check if build succeeds npm run build && echo "PASS" || echo "FAIL" ``` ### 2. Model-Based Grader Use Claude to evaluate open-ended outputs: ```markdown [MODEL GRADER PROMPT]
Use this skill
Add a "skill" field with the skill’s ID to your chat completion request. It is applied server-side before your prompt is sent — no extra calls.
{
"model": "gpt-4o-mini",
"skill": "imp-372fd643-733b-4a15-bb87-8883fa99e7af",
"messages": [{ "role": "user", "content": "…" }]
}Install the skill, enable it in your dashboard and (optionally) limit it to specific models. It then applies automatically to every matching request — with no "skill" field to send each time.
Set it up in your dashboardMore skills
Set up and use 1Password CLI for sign-in, desktop integration, and reading or injecting secrets.
Create, view, edit, delete, search, move, or export Apple Notes via the memo CLI on macOS.
List, add, edit, complete, or delete Apple Reminders and reminder lists via remindctl.
Create, search, and manage Bear notes via grizzly CLI.
Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.
BluOS CLI (blu) for discovery, playback, grouping, and volume.
Capture frames or clips from RTSP/ONVIF cameras.
Search, install, update, sync, or publish agent skills with the ClawHub CLI and registry.