eval-harness

Official

by Api.AirforcePrepends a system promptAI & Agent Building000 uses202,700

Claude Codeセッションの正式な評価フレームワークで、評価駆動開発（EDD）の原則を実装します

open-sourceclaude-codeai-agent-buildingaffaan-m

What this skill does

When applied, it prepends a system prompt before your request is sent — no extra calls and no change to how you are billed beyond the added tokens.

---
name: eval-harness
description: Claude Codeセッションの正式な評価フレームワークで、評価駆動開発（EDD）の原則を実装します
tools: Read, Write, Edit, Bash, Grep, Glob
---

# Eval Harnessスキル

Claude Codeセッションの正式な評価フレームワークで、評価駆動開発（EDD）の原則を実装します。

## 哲学

評価駆動開発は評価を「AI開発のユニットテスト」として扱います：
- 実装前に期待される動作を定義
- 開発中に継続的に評価を実行
- 変更ごとにリグレッションを追跡
- 信頼性測定にpass@kメトリクスを使用

## 評価タイプ

### 能力評価
Claudeが以前できなかったことができるようになったかをテスト：
```markdown
[CAPABILITY EVAL: feature-name]
タスク: Claudeが達成すべきことの説明
成功基準:
  - [ ] 基準1
  - [ ] 基準2
  - [ ] 基準3
期待される出力: 期待される結果の説明
```

### リグレッション評価
変更が既存の機能を破壊しないことを確認：
```markdown
[REGRESSION EVAL: feature-name]
ベースライン: SHAまたはチェックポイント名
テスト:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
結果: X/Y 成功（以前は Y/Y）
```

## 評価者タイプ

### 1. コードベース評価者
コードを使用した決定論的チェック：
```bash
# ファイルに期待されるパターンが含まれているかチェック
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"

# テストが成功するかチェック
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"

# ビルドが成功するかチェック
npm run build && echo "PASS" || echo "FAIL"
```

### 2. モデルベース評価者
Claudeを使用して自由形式の出力を評価：
```markdown
[MODEL GRADER PROMPT]
次のコード変更を評価してください：
1. 記述された問題を解決していますか？
2. 構造化されていますか？
3. エッジケースは処理されていますか？
4. エラー処理は適切ですか？

スコア: 1-5（1=不良、5=優秀）
理由: [説明]
```

### 3. 人間評価者
手動レビューのためにフラグを立てる：
```markdown
[HUMAN REVIEW REQUIRED]
変更内容: 何が変更されたかの説明
理由: 人間のレビューが必要な理由
リスクレベル: LOW/MEDIUM/HIGH
```

## メトリクス

### pass@k
「k回の試行で少なくとも1回成功」
- pass@1: 最初の試行での成功率
- pass@3: 3回以内の成功
- 一般的な目標: pass@3 > 90%

### pass^k
「k回の試行すべてが成功」
- より高い信頼性の基準
- pass^3: 3回連続成功
- クリティカルパスに使用

## 評価ワークフロー

### 1. 定義（コーディング前）
```markdown
## 評価定義: feature-xyz

### 能力評価
1. 新しいユーザーアカウントを作成できる
2. メール形式を検証できる
3. パスワードを安全にハッシュ化できる

### リグレッション評価
1. 既存のログインが引き続き機能する
2. セッション管理が変更されていない
3. ログアウトフローが維持されている

### 成功メトリクス
- 能力評価で pass@3 > 90%
- リグレッション評価で pass^3 = 100%
```

### 2. 実装
定義された評価に合格するコードを書く。

### 3. 評価
```bash
# 能力評価を実行
[各能力評価を実行し、PASS/FAILを記録]

# リグレッション評価を実行
npm test -- --testPathPattern="existing"

# レポートを生成
```

### 4. レポート
```

Use this skill

Per request

Add a "skill" field with the skill’s ID to your chat completion request. It is applied server-side before your prompt is sent — no extra calls.

{
  "model": "gpt-4o-mini",
  "skill": "imp-694228d8-f25c-40f0-9472-a0db1b5aec4c",
  "messages": [{ "role": "user", "content": "…" }]
}

Always on — no field to send

Install the skill, enable it in your dashboard and (optionally) limit it to specific models. It then applies automatically to every matching request — with no "skill" field to send each time.

Set it up in your dashboard

More skills

node-connect

Diagnose OpenClaw Android, iOS, or macOS node pairing, QR/setup code, route, auth, and connection failures.

1password

Set up and use 1Password CLI for sign-in, desktop integration, and reading or injecting secrets.

apple-notes

Create, view, edit, delete, search, move, or export Apple Notes via the memo CLI on macOS.

apple-reminders

List, add, edit, complete, or delete Apple Reminders and reminder lists via remindctl.

bear-notes

Create, search, and manage Bear notes via grizzly CLI.

blogwatcher

Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.

blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

camsnap

Capture frames or clips from RTSP/ONVIF cameras.