安装方式
手动下载安装
下载 ZIP 后解压到技能目录即可安装。若在桌面客户端 WebView中直接下载出现异常,本站会改为提示页 + 原始链接,请按页内说明操作。
下载 ZIP (shub-playwright-scraper-skill-v1.2.0.zip)触发指令
/playwright-scraper-skill
跨平台安装指引
该技能声明兼容以下 1 个平台,将 ZIP 解压到对应目录即可被识别。
unzip shub-playwright-scraper-skill-v1.2.0.zip -d ~/.claude/skills/
mkdir -p 创建;启用 Skill 后请重启对应 Agent 让配置生效。
使用指南
Playwright 网页抓取
围绕 Playwright 网页抓取:用 Playwright 做抓取与结构化抽取;遵守站点 robots 与服务条款。 无需在每次任务前把零散英文说明手工拼进上下文,也 减少 与客户端默认行为脱节的试错;具体命令、钩子与 JSON 参数仍以 ZIP 包内 SKILL.md 为权威。下文结构与站内 MCP CLI 类专题稿相同:何时用、前置、流程、速查与故障。
何时使用
- 用 Playwright 做抓取与结构化抽取
- 遵守站点 robots 与服务条款
- 已获取本技能 ZIP,并准备在 Claude Code / OpenClaw 中按 SKILL.md 挂载。
- 希望用中文专题稿快速判断「该不该启用」,再深入英文 SKILL 查参数与边界。
- 需要与团队对齐同一套触发方式、目录约定或回调格式时。
前置条件
- 通用:可运行 Claude Code 或文档要求的客户端;有可读写的项目工作区(或 SKILL.md 指定的沙箱目录)。
- 权威细节:API Key / OAuth、钩子路径、环境变量以 ZIP 内 SKILL.md 为准。
典型流程
- 从 ClawHub / 站内分发获取技能 ZIP,校验版本与校验和(若提供)。
- 阅读 SKILL.md 的安装段落:目录落点、客户端类型(Claude Code / OpenClaw / 脚本)。
- 用文档中的最小示例完成第一次调用(单文件修改、单次查询或单次委派)。
- 确认工作目录、权限边界与输出路径后,再处理多文件或长耗时任务。
- 需要回调 / Webhook / 通知时,按 SKILL.md 配置端点并在测试环境先验通。
与 ZIP / SKILL.md 的关系
站内专题稿与 MCP CLI 类 oss 稿同样:概括何时用、怎么接、怎么排错;命令模板、钩子名、JSON 字段、版本矩阵一律以 ZIP 内 SKILL.md 与 ClawHub 上游为准。
命令示例(摘自包内 SKILL.md)
以下为从上游 SKILL.md(或入库正文)自动抽取的终端/脚本片段;路径、环境变量与参数以当前 ZIP 与官方说明为准。
ClawHub slug:playwright-scraper-skill(安装命令以 SKILL.md / claw CLI 为准)。
cd playwright-scraper-skill
npm install
npx playwright install chromium
# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com
node scripts/playwright-simple.js "https://example.com"
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
# Install deep-scraper skill
npx clawhub install deep-scraper
# Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"
# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL
# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL
# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL
# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL
站内入库时的触发命令(完整语义见 ZIP):
# 使用本技能时可在对话中引用或执行上述指令;完整参数与示例见下载包内 SKILL.md。
/playwright-scraper-skill
最佳实践
- 先 SKILL.md 再猜参数;站内专题稿不替代 schema 与必填字段说明。
- 委派任务时写清验收标准(命令、文件路径、测试命令),减少来回追问。
- 长任务用文档推荐的回调 / 日志落盘代替高频轮询,省 Token 也省机器负载。
- 多技能同时启用时,注意钩子加载顺序与重复工具调用(以 SKILL.md 冲突说明为准)。
调试与排错
- 打开 stderr 与客户端日志;PTY/tmux 场景同时看面板最后几十行输出。
- 参数错误时对照 SKILL.md 中的 JSON/CLI 示例(引号、转义、工作目录)。
- 网络类失败:查代理、防火墙、MCP 传输方式(stdio / HTTP / SSE)。
速查
| 动作 | 说明 |
|------|------|
| 获取技能包 | ClawHub / 站内 ZIP,核对版本 |
| 权威步骤 | 优先阅读 ZIP 内 SKILL.md |
| 首次试跑 | 使用 SKILL.md 最小示例 |
| 验收 | 对照路径、测试命令或回调负载 |
常见故障
- 无输出或立即退出 → 工作目录错误、依赖未装、或 Claude Code 未登录;按 SKILL.md 自检清单执行。
- 权限被拒绝 → 检查沙箱路径、
--permission-mode与工具白名单。 - 与简介不符 → 以英文 SKILL 与上游仓库为准,站内稿仅作结构化导读。
# Playwright Scraper Skill
A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.
---
## 🎯 Use Case Matrix
| Target Website | Anti-Bot Level | Recommended Method | Script |
|---------------|----------------|-------------------|--------|
| **Regular Sites** | Low | web_fetch tool | N/A (built-in) |
| **Dynamic Sites** | Medium | Playwright Simple | `scripts/playwright-simple.js` |
| **Cloudflare Protected** | High | **Playwright Stealth** ⭐ | `scripts/playwright-stealth.js` |
| **YouTube** | Special | deep-scraper | Install separately |
| **Reddit** | Special | reddit-scraper | Install separately |
---
## 📦 Installation
```bash
cd playwright-scraper-skill
npm install
npx playwright install chromium
```
---
## 🚀 Quick Start
### 1️⃣ Simple Sites (No Anti-Bot)
Use OpenClaw's built-in `web_fetch` tool:
```bash
# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com
```
---
### 2️⃣ Dynamic Sites (Requires JavaScript)
Use **Playwright Simple**:
```bash
node scripts/playwright-simple.js "https://example.com"
```
**Example output:**
```json
{
"url": "https://example.com",
"title": "Example Domain",
"content": "...",
"elapsedSeconds": "3.45"
}
```
---
### 3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)
Use **Playwright Stealth**:
```bash
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
```
**Features:**
- Hide automation markers (`navigator.webdriver = false`)
- Realistic User-Agent (iPhone, Android)
- Random delays to mimic human behavior
- Screenshot and HTML saving support
---
### 4️⃣ YouTube Video Transcripts
Use **deep-scraper** (install separately):
```bash
# Install deep-scraper skill
npx clawhub install deep-scraper
# Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"
```
---
## 📖 Script Descriptions
### `scripts/playwright-simple.js`
- **Use Case:** Regular dynamic websites
- **Speed:** Fast (3-5 seconds)
- **Anti-Bot:** None
- **Output:** JSON (title, content, URL)
### `scripts/playwright-stealth.js` ⭐
- **Use Case:** Sites with Cloudflare or anti-bot protection
- **Speed:** Medium (5-20 seconds)
- **Anti-Bot:** Medium-High (hides automation, realistic UA)
- **Output:** JSON + Screenshot + HTML file
- **Verified:** 100% success on Discuss.com.hk
---
## 🎓 Best Practices
### 1. Try web_fetch First
If the site doesn't have dynamic loading, use OpenClaw's `web_fetch` tool—it's fastest.
### 2. Need JavaScript? Use Playwright Simple
If you need to wait for JavaScript rendering, use `playwright-simple.js`.
### 3. Getting Blocked? Use Stealth
If you encounter 403 or Cloudflare challenges, use `playwright-stealth.js`.
### 4. Special Sites Need Specialized Skills
- YouTube → deep-scraper
- Reddit → reddit-scraper
- Twitter → bird skill
---
## 🔧 Customization
All scripts support environment variables:
```bash
# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL
# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL
# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL
# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL
```
---
## 📊 Performance Comparison
| Method | Speed | Anti-Bot | Success Rate (Discuss.com.hk) |
|--------|-------|----------|-------------------------------|
| web_fetch | ⚡ Fastest | ❌ None | 0% |
| Playwright Simple | 🚀 Fast | ⚠️ Low | 20% |
| **Playwright Stealth** | ⏱️ Medium | ✅ Medium | **100%** ✅ |
| Puppeteer Stealth | ⏱️ Medium | ✅ Medium-High | ~80% |
| Crawlee (deep-scraper) | 🐢 Slow | ❌ Detected | 0% |
| Chaser (Rust) | ⏱️ Medium | ❌ Detected | 0% |
---
## 🛡️ Anti-Bot Techniques Summary
Lessons learned from our testing:
### ✅ Effective Anti-Bot Measures
1. **Hide `navigator.webdriver`** — Essential
2. **Realistic User-Agent** — Use real devices (iPhone, Android)
3. **Mimic Human Behavior** — Random delays, scrolling
4. **Avoid Framework Signatures** — Crawlee, Selenium are easily detected
5. **Use `addInitScript` (Playwright)** — Inject before page load
### ❌ Ineffective Anti-Bot Measures
1. **Only changing User-Agent** — Not enough
2. **Using high-level frameworks (Crawlee)** — More easily detected
3. **Docker isolation** — Doesn't help with Cloudflare
---
## 🔍 Troubleshooting
### Issue: 403 Forbidden
**Solution:** Use `playwright-stealth.js`
### Issue: Cloudflare Challenge Page
**Solution:**
1. Increase wait time (10-15 seconds)
2. Try `headless: false` (headful mode sometimes has higher success rate)
3. Consider using proxy IPs
### Issue: Blank Page
**Solution:**
1. Increase `waitForTimeout`
2. Use `waitUntil: 'networkidle'` or `'domcontentloaded'`
3. Check if login is required
---
## 📝 Memory & Experience
### 2026-02-07 Discuss.com.hk Test Conclusions
- ✅ **Pure Playwright + Stealth** succeeded (5s, 200 OK)
- ❌ Crawlee (deep-scraper) failed (403)
- ❌ Chaser (Rust) failed (Cloudflare)
- ❌ Puppeteer standard failed (403)
**Best Solution:** Pure Playwright + anti-bot techniques (framework-independent)
---
## 🚧 Future Improvements
- [ ] Add proxy IP rotation
- [ ] Implement cookie management (maintain login state)
- [ ] Add CAPTCHA handling (2captcha / Anti-Captcha)
- [ ] Batch scraping (parallel URLs)
- [ ] Integration with OpenClaw's `browser` tool
---
## 📚 References
- [Playwright Official Docs](https://playwright.dev/)
- [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)
- [deep-scraper skill](https://clawhub.com/opsun/deep-scraper)