eval-harness

Official

by Api.AirforcePrepends a system promptAI & Agent Building000 uses202,700

克劳德代码会话的正式评估框架，实施评估驱动开发（EDD）原则

open-sourceclaude-codeai-agent-buildingaffaan-m

What this skill does

When applied, it prepends a system prompt before your request is sent — no extra calls and no change to how you are billed beyond the added tokens.

---
name: eval-harness
description: 克劳德代码会话的正式评估框架，实施评估驱动开发（EDD）原则
origin: ECC
tools: Read, Write, Edit, Bash, Grep, Glob
---

# Eval Harness 技能

一个用于 Claude Code 会话的正式评估框架，实现了评估驱动开发 (EDD) 原则。

## 何时激活

* 为 AI 辅助工作流程设置评估驱动开发 (EDD)
* 定义 Claude Code 任务完成的标准（通过/失败）
* 使用 pass@k 指标衡量代理可靠性
* 为提示或代理变更创建回归测试套件
* 跨模型版本对代理性能进行基准测试

## 理念

评估驱动开发将评估视为 "AI 开发的单元测试"：

* 在实现 **之前** 定义预期行为
* 在开发过程中持续运行评估
* 跟踪每次更改的回归情况
* 使用 pass@k 指标来衡量可靠性

## 评估类型

### 能力评估

测试 Claude 是否能完成之前无法完成的事情：

```markdown
[能力评估：功能名称]
任务：描述 Claude 应完成的工作
成功标准：
  - [ ] 标准 1
  - [ ] 标准 2
  - [ ] 标准 标准 3
预期输出：对预期结果的描述

```

### 回归评估

确保更改不会破坏现有功能：

```markdown
[回归评估：功能名称]
基线：SHA 或检查点名称
测试：
  - 现有测试-1：通过/失败
  - 现有测试-2：通过/失败
  - 现有测试-3：通过/失败
结果：X/Y 通过（之前为 Y/Y）

```

## 评分器类型

### 1. 基于代码的评分器

使用代码进行确定性检查：

```bash
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"

# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"

# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
```

### 2. 基于模型的评分器

使用 Claude 来评估开放式输出：

```markdown
[MODEL GRADER PROMPT]
评估以下代码变更：
1. 它是否解决了所述问题？
2. 它的结构是否良好？
3. 是否处理了边界情况？
4. 错误处理是否恰当？

评分：1-5 (1=差，5=优秀)
推理：[解释]

```

### 3. 人工评分器

标记为需要手动审查：

```markdown
[HUMAN REVIEW REQUIRED]
变更：对更改内容的描述
原因：为何需要人工审核
风险等级：低/中/高

```

## 指标

### pass@k

"k 次尝试中至少成功一次"

* pass@1：首次尝试成功率
* pass@3：3 次尝试内成功率
* 典型目标：pass@3 > 90%

### pass^k

"所有 k 次试验都成功"

* 更高的可靠性门槛
* pass^3：连续 3 次成功
* 用于关键路径

## 评估工作流程

### 1. 定义（编码前）

```markdown
## 评估定义：功能-xyz

### 能力评估
1. 可以创建新用户账户
2. 可以验证电子邮件格式
3. 可以安全地哈希密码

### 回归评估
1. 现有登录功能仍然有效
2. 会话管理未改变
3. 注销流程完整

### 成功指标
- 能力评估的 pass@3 > 90%
- 回归评估的 pass^3 = 100%

```

### 2. 实现

编写代码以通过已定义的评估。

### 3. 评估

```bash
# Run capability evals
[Run each capability eval, record PASS/FAIL]

# Run regression evals
npm test -- --testPathPattern="existing"

# Generate report
```

### 4. 报告

```markdown
评估报告：功能-xyz
========================

能力评估：
  创建用户：    通过（通过@1）
  验证邮

Use this skill

Per request

Add a "skill" field with the skill’s ID to your chat completion request. It is applied server-side before your prompt is sent — no extra calls.

{
  "model": "gpt-4o-mini",
  "skill": "imp-5e923559-e8d9-460b-ae89-0a70994ba708",
  "messages": [{ "role": "user", "content": "…" }]
}

Always on — no field to send

Install the skill, enable it in your dashboard and (optionally) limit it to specific models. It then applies automatically to every matching request — with no "skill" field to send each time.

Set it up in your dashboard

More skills

node-connect

Diagnose OpenClaw Android, iOS, or macOS node pairing, QR/setup code, route, auth, and connection failures.

1password

Set up and use 1Password CLI for sign-in, desktop integration, and reading or injecting secrets.

apple-notes

Create, view, edit, delete, search, move, or export Apple Notes via the memo CLI on macOS.

apple-reminders

List, add, edit, complete, or delete Apple Reminders and reminder lists via remindctl.

bear-notes

Create, search, and manage Bear notes via grizzly CLI.

blogwatcher

Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.

blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

camsnap

Capture frames or clips from RTSP/ONVIF cameras.