eval-harness

Official

by Api.AirforcePrepends a system promptAI & Agent Building000 uses202,700

Eval-driven development (EDD) ilkelerini uygulayan Claude Code oturumları için formal değerlendirme çerçevesi

open-sourceclaude-codeai-agent-buildingaffaan-m

What this skill does

When applied, it prepends a system prompt before your request is sent — no extra calls and no change to how you are billed beyond the added tokens.

---
name: eval-harness
description: Eval-driven development (EDD) ilkelerini uygulayan Claude Code oturumları için formal değerlendirme çerçevesi
origin: ECC
tools: Read, Write, Edit, Bash, Grep, Glob
---

# Eval Harness Skill

Claude Code oturumları için eval-driven development (EDD) ilkelerini uygulayan formal değerlendirme çerçevesi.

## Ne Zaman Aktifleştirmeli

- AI destekli iş akışları için eval-driven development (EDD) kurarken
- Claude Code görev tamamlama için geçti/kaldı kriterleri tanımlarken
- pass@k metrikleriyle agent güvenilirliğini ölçerken
- Prompt veya agent değişiklikleri için regresyon test paketleri oluştururken
- Model versiyonları arasında agent performansını benchmark ederken

## Felsefe

Eval-Driven Development, eval'ları "AI geliştirmenin birim testleri" olarak ele alır:
- İmplementasyondan ÖNCE beklenen davranışı tanımla
- Geliştirme sırasında eval'ları sürekli çalıştır
- Her değişiklikle regresyonları izle
- Güvenilirlik ölçümü için pass@k metriklerini kullan

## Eval Tipleri

### Capability Eval'ları
Claude'un daha önce yapamadığı bir şeyi yapıp yapamadığını test et:
```markdown
[CAPABILITY EVAL: feature-name]
Görev: Claude'un başarması gereken şeyin açıklaması
Başarı Kriterleri:
  - [ ] Kriter 1
  - [ ] Kriter 2
  - [ ] Kriter 3
Beklenen Çıktı: Beklenen sonucun açıklaması
```

### Regression Eval'ları
Değişikliklerin mevcut fonksiyonaliteyi bozmadığından emin ol:
```markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA veya checkpoint adı
Testler:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Sonuç: X/Y geçti (önceden Y/Y)
```

## Grader Tipleri

### 1. Code-Based Grader
Kod kullanarak deterministik kontroller:
```bash
# Dosyanın beklenen pattern içerip içermediğini kontrol et
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"

# Testlerin geçip geçmediğini kontrol et
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"

# Build'in başarılı olup ol

Use this skill

Per request

Add a "skill" field with the skill’s ID to your chat completion request. It is applied server-side before your prompt is sent — no extra calls.

{
  "model": "gpt-4o-mini",
  "skill": "imp-59567b15-60f2-4e5a-9c41-2ae2031aeb0c",
  "messages": [{ "role": "user", "content": "…" }]
}

Always on — no field to send

Install the skill, enable it in your dashboard and (optionally) limit it to specific models. It then applies automatically to every matching request — with no "skill" field to send each time.

Set it up in your dashboard

More skills

node-connect

Diagnose OpenClaw Android, iOS, or macOS node pairing, QR/setup code, route, auth, and connection failures.

1password

Set up and use 1Password CLI for sign-in, desktop integration, and reading or injecting secrets.

apple-notes

Create, view, edit, delete, search, move, or export Apple Notes via the memo CLI on macOS.

apple-reminders

List, add, edit, complete, or delete Apple Reminders and reminder lists via remindctl.

bear-notes

Create, search, and manage Bear notes via grizzly CLI.

blogwatcher

Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.

blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

camsnap

Capture frames or clips from RTSP/ONVIF cameras.