eval-harness

Official

by Api.AirforcePrepends a system promptAI & Agent Building000 uses202,700

평가 주도 개발(EDD) 원칙을 구현하는 Claude Code 세션용 공식 평가 프레임워크

open-sourceclaude-codeai-agent-buildingaffaan-m

What this skill does

When applied, it prepends a system prompt before your request is sent — no extra calls and no change to how you are billed beyond the added tokens.

---
name: eval-harness
description: 평가 주도 개발(EDD) 원칙을 구현하는 Claude Code 세션용 공식 평가 프레임워크
origin: ECC
tools: Read, Write, Edit, Bash, Grep, Glob
---

# 평가 하네스 스킬

Claude Code 세션을 위한 공식 평가 프레임워크로, 평가 주도 개발(EDD) 원칙을 구현합니다.

## 활성화 시점

- AI 지원 워크플로우에 평가 주도 개발(EDD) 설정 시
- Claude Code 작업 완료에 대한 합격/불합격 기준 정의 시
- pass@k 메트릭으로 에이전트 신뢰성 측정 시
- 프롬프트 또는 에이전트 변경에 대한 회귀 테스트 스위트 생성 시
- 모델 버전 간 에이전트 성능 벤치마킹 시

## 철학

평가 주도 개발은 평가를 "AI 개발의 단위 테스트"로 취급합니다:
- 구현 전에 예상 동작 정의
- 개발 중 지속적으로 평가 실행
- 각 변경 시 회귀 추적
- 신뢰성 측정을 위해 pass@k 메트릭 사용

## 평가 유형

### 기능 평가
Claude가 이전에 할 수 없었던 것을 할 수 있는지 테스트:
```markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result
```

### 회귀 평가
변경 사항이 기존 기능을 손상시키지 않는지 확인:
```markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
```

## 채점자 유형

### 1. 코드 기반 채점자
코드를 사용한 결정론적 검사:
```bash
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"

# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"

# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
```

### 2. 모델 기반 채점자
Claude를 사용하여 개방형 출력 평가:
```markdown
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?

Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
```

### 3. 사람 채점자
수동 검토 플래그:
```markdown
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
```

## 메트릭

### pass@k
"k번 시도 중 최소 한 번 성공"
- pass@1: 첫 번째 시도 성공률
- pass@3: 3번 시도 내 성공
- 일반적인 목표: pass@3 > 90%

### pass^k
"k번 시행 모두 성공"
- 신뢰성에

Use this skill

Per request

Add a "skill" field with the skill’s ID to your chat completion request. It is applied server-side before your prompt is sent — no extra calls.

{
  "model": "gpt-4o-mini",
  "skill": "imp-6faf0aff-c923-4a48-9d86-053be4b78d2b",
  "messages": [{ "role": "user", "content": "…" }]
}

Always on — no field to send

Install the skill, enable it in your dashboard and (optionally) limit it to specific models. It then applies automatically to every matching request — with no "skill" field to send each time.

Set it up in your dashboard

More skills

node-connect

Diagnose OpenClaw Android, iOS, or macOS node pairing, QR/setup code, route, auth, and connection failures.

1password

Set up and use 1Password CLI for sign-in, desktop integration, and reading or injecting secrets.

apple-notes

Create, view, edit, delete, search, move, or export Apple Notes via the memo CLI on macOS.

apple-reminders

List, add, edit, complete, or delete Apple Reminders and reminder lists via remindctl.

bear-notes

Create, search, and manage Bear notes via grizzly CLI.

blogwatcher

Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.

blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

camsnap

Capture frames or clips from RTSP/ONVIF cameras.