data-scraper-agent

Official

by Api.AirforcePrepends a system promptData & Analytics000 uses202,700

构建一个全自动化的AI驱动数据收集代理，适用于任何公共来源——招聘网站、价格信息、新闻、GitHub、体育赛事等任何内容。按计划进行抓取，使用免费LLM（Gemini Flash）丰富数据，将结果存储在Notion/Sheets/Supabase中，并从用户反馈中学习。完全免费在GitHub Actions上运行。适用于用户希望自动监控、收集或跟踪任何公共数据的场景。

open-sourceclaude-codedata-analyticsaffaan-m

What this skill does

When applied, it prepends a system prompt before your request is sent — no extra calls and no change to how you are billed beyond the added tokens.

---
name: data-scraper-agent
description: 构建一个全自动化的AI驱动数据收集代理，适用于任何公共来源——招聘网站、价格信息、新闻、GitHub、体育赛事等任何内容。按计划进行抓取，使用免费LLM（Gemini Flash）丰富数据，将结果存储在Notion/Sheets/Supabase中，并从用户反馈中学习。完全免费在GitHub Actions上运行。适用于用户希望自动监控、收集或跟踪任何公共数据的场景。
origin: community
---

# 数据抓取代理

构建一个生产就绪、AI驱动的数据收集代理，适用于任何公共数据源。
按计划运行，使用免费LLM丰富结果，存储到数据库，并随时间推移不断改进。

**技术栈：Python · Gemini Flash (免费) · GitHub Actions (免费) · Notion / Sheets / Supabase**

## 何时激活

* 用户想要抓取或监控任何公共网站或API
* 用户说"构建一个检查...的机器人"、"为我监控X"、"从...收集数据"
* 用户想要跟踪工作、价格、新闻、仓库、体育比分、事件、列表
* 用户询问如何自动化数据收集而无需支付托管费用
* 用户想要一个能根据他们的决策随时间推移变得更智能的代理

## 核心概念

### 三层架构

每个数据抓取代理都有三层：

```
COLLECT → ENRICH → STORE
  │           │        │
Scraper    AI (LLM)  Database
runs on    scores/   Notion /
schedule   summarises Sheets /
           & classifies Supabase
```

### 免费技术栈

| 层级 | 工具 | 原因 |
|---|---|---|
| **抓取** | `requests` + `BeautifulSoup` | 无成本，覆盖80%的公共网站 |
| **JS渲染的网站** | `playwright` (免费) | 当HTML抓取失败时使用 |
| **AI丰富** | 通过REST API的Gemini Flash | 500次请求/天，100万令牌/天 — 免费 |
| **存储** | Notion API | 免费层级，用于审查的优秀UI |
| **调度** | GitHub Actions cron | 对公共仓库免费 |
| **学习** | 仓库中的JSON反馈文件 | 零基础设施，在git中持久化 |

### AI模型后备链

构建代理以在配额耗尽时自动在Gemini模型间回退：

```
gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (fallback)
```

### 批量API调用以提高效率

切勿为每个项目单独调用LLM。始终批量处理：

```python
# BAD: 33 API calls for 33 items
for item in items:
    result = call_ai(item)  # 33 calls → hits rate limit

# GOOD: 7 API calls for 33 items (batch size 5)
for batch in chunks(items, size=5):
    results = call_ai(batch)  # 7 calls → stays within free tier
```

***

## 工作流程

### 步骤 1: 理解目标

询问用户：

1. **收集什么：** "数据源是什么？URL / API / RSS / 公共端点？"
2. **提取什么：** "哪些字段重要？标题、价格、URL、日期、分数？"
3. **如何存储：** "结果应该存储在哪里？Notion、Google Sheets、Supabase，还是本地文件？"
4. **如何丰富：** "您希望AI对每个项目进行评分、总结、分类或匹配吗？"
5. **频率：** "应该多久运行一次？每小时、每天、每周？"

常见的提示示例：

* 招聘网站 → 根据简历评分相关性
* 产品价格 → 降价时发出警报
* GitHub仓库 → 总结新版本
* 新闻源 → 按主题+情感分类
* 体育结果 → 提取统计数据到跟踪器
* 活动日历 →

Use this skill

Per request

Add a "skill" field with the skill’s ID to your chat completion request. It is applied server-side before your prompt is sent — no extra calls.

{
  "model": "gpt-4o-mini",
  "skill": "imp-1e99a9fb-59f2-4005-b642-902558897153",
  "messages": [{ "role": "user", "content": "…" }]
}

Always on — no field to send

Install the skill, enable it in your dashboard and (optionally) limit it to specific models. It then applies automatically to every matching request — with no "skill" field to send each time.

Set it up in your dashboard

More skills

node-connect

Diagnose OpenClaw Android, iOS, or macOS node pairing, QR/setup code, route, auth, and connection failures.

1password

Set up and use 1Password CLI for sign-in, desktop integration, and reading or injecting secrets.

apple-notes

Create, view, edit, delete, search, move, or export Apple Notes via the memo CLI on macOS.

apple-reminders

List, add, edit, complete, or delete Apple Reminders and reminder lists via remindctl.

bear-notes

Create, search, and manage Bear notes via grizzly CLI.

blogwatcher

Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.

blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

camsnap

Capture frames or clips from RTSP/ONVIF cameras.