agent-eval
Official编码代理(Claude Code、Aider、Codex等)在自定义任务上的直接比较,包含通过率、成本、时间和一致性指标
What this skill does
When applied, it prepends a system prompt before your request is sent — no extra calls and no change to how you are billed beyond the added tokens.
---
name: agent-eval
description: 编码代理(Claude Code、Aider、Codex等)在自定义任务上的直接比较,包含通过率、成本、时间和一致性指标
origin: ECC
tools: Read, Write, Edit, Bash, Grep, Glob
---
# Agent Eval 技能
一个轻量级 CLI 工具,用于在可复现的任务上对编码代理进行头对头比较。每个“哪个编码代理最好?”的比较都基于感觉——本工具将其系统化。
## 何时使用
* 在你自己的代码库上比较编码代理(Claude Code、Aider、Codex 等)
* 在采用新工具或模型之前衡量代理性能
* 当代理更新其模型或工具时运行回归检查
* 为团队做出数据支持的代理选择决策
## 安装
```bash
# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
```
## 核心概念
### YAML 任务定义
以声明方式定义任务。每个任务指定要做什么、要修改哪些文件以及如何判断成功:
```yaml
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
```
### Git 工作树隔离
每个代理运行都获得自己的 git 工作树——无需 Docker。这提供了可复现的隔离,使得代理之间不会相互干扰或损坏基础仓库。
### 收集的指标
| 指标 | 衡量内容 |
|--------|-----------------|
| 通过率 | 代理生成的代码是否通过了判断? |
| 成本 | 每个任务的 API 花费(如果可用) |
| 时间 | 完成所需的挂钟秒数 |
| 一致性 | 跨重复运行的通过率(例如,3/3 = 100%) |
## 工作流程
### 1. 定义任务
创建一个 `tasks/` 目录,其中包含 YAML 文件,每个任务一个文件:
```bash
mkdir tasks
# Write task definitions (see template above)
```
### 2. 运行代理
针对你的任务执行代理:
```bash
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
```
每次运行:
1. 从指定的提交创建一个新的 git 工作树
2. 将提示交给代理
3. 运行判断标准
4. 记录通过/失败、成本和时间
### 3. 比较结果
生成比较报告:
```bash
agent-eval report --format table
```
```
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 1Use this skill
Add a "skill" field with the skill’s ID to your chat completion request. It is applied server-side before your prompt is sent — no extra calls.
{
"model": "gpt-4o-mini",
"skill": "imp-e34b471d-9c24-4ba0-a8ec-95b63da36517",
"messages": [{ "role": "user", "content": "…" }]
}Install the skill, enable it in your dashboard and (optionally) limit it to specific models. It then applies automatically to every matching request — with no "skill" field to send each time.
Set it up in your dashboardMore skills
Set up and use 1Password CLI for sign-in, desktop integration, and reading or injecting secrets.
Create, view, edit, delete, search, move, or export Apple Notes via the memo CLI on macOS.
List, add, edit, complete, or delete Apple Reminders and reminder lists via remindctl.
Create, search, and manage Bear notes via grizzly CLI.
Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.
BluOS CLI (blu) for discovery, playback, grouping, and volume.
Capture frames or clips from RTSP/ONVIF cameras.
Search, install, update, sync, or publish agent skills with the ClawHub CLI and registry.