Co-Pilot / 辅助式
更新于 a month ago

article-extractor

Mmichalparkola
0.2k
michalparkola/tapestry-skills-for-claude-code/article-extractor
72
Agent 评分

💡 摘要

从URL中提取干净的正文文本,去除广告和杂乱内容,并将其保存为可读的文本文件。

🎯 适合人群

需要保存文章进行分析的研究人员需要整理阅读材料的学生构建内容档案的内容策展人需要离线内容的开发者

🤖 AI 吐槽:这只是一个现有CLI工具的华丽包装,功能上和一个简单的Shell脚本相差无几。

安全分析中风险

该技能使用用户提供的URL执行shell命令,如果URL未经过清理,存在命令注入风险。同时,它会下载和处理任意HTML,这些内容可能是恶意的。缓解措施:在将URL传递给任何命令之前,对其进行严格的验证和清理。


name: article-extractor description: Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter. allowed-tools: Bash,Write

Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

When to Use This Skill

Activate when the user:

  • Provides an article/blog URL and wants the text content
  • Asks to "download this article"
  • Wants to "extract the content from [URL]"
  • Asks to "save this blog post as text"
  • Needs clean article text without distractions

How It Works

Priority Order:

  1. Check if tools are installed (reader or trafilatura)
  2. Download and extract article using best available tool
  3. Clean up the content (remove extra whitespace, format properly)
  4. Save to file with article title as filename
  5. Confirm location and show preview

Installation Check

Check for article extraction tools in this order:

Option 1: reader (Recommended - Mozilla's Readability)

command -v reader

If not installed:

npm install -g @mozilla/readability-cli # or npm install -g reader-cli

Option 2: trafilatura (Python-based, very good)

command -v trafilatura

If not installed:

pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

Extraction Methods

Method 1: Using reader (Best for most articles)

# Extract article reader "URL" > article.txt

Pros:

  • Based on Mozilla's Readability algorithm
  • Excellent at removing clutter
  • Preserves article structure

Method 2: Using trafilatura (Best for blogs/news)

# Extract article trafilatura --URL "URL" --output-format txt > article.txt # Or with more options trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

Pros:

  • Very accurate extraction
  • Good with various site structures
  • Handles multiple languages

Options:

  • --no-comments: Skip comment sections
  • --no-tables: Skip data tables
  • --precision: Favor precision over recall
  • --recall: Extract more content (may include some noise)

Method 3: Fallback (curl + basic parsing)

# Download and extract basic content curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys class ArticleExtractor(HTMLParser): def __init__(self): super().__init__() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None def handle_starttag(self, tag, attrs): if tag not in self.skip_tags: if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}: self.in_content = True self.current_tag = tag def handle_data(self, data): if self.in_content and data.strip(): self.content.append(data.strip()) def get_content(self): return '\n\n'.join(self.content) parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt

Note: This is less reliable but works without dependencies.

Getting Article Title

Extract title for filename:

Using reader:

# reader outputs markdown with title at top TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

Using trafilatura:

# Get metadata including title TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

Using curl (fallback):

TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

Clean title for filesystem:

# Get title TITLE="Article Title from Website" # Clean for filesystem (remove special chars, limit length) FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//') # Add extension FILENAME="${FILENAME}.txt"

Complete Workflow

ARTICLE_URL="https://example.com/article" # Check for tools if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi # Extract article case $TOOL in reader) # Get content reader "$ARTICLE_URL" > temp_article.txt # Get title (first line after # in markdown) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # Get title from metadata METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # Get clean content trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # Get title TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # Remove site name TITLE=${TITLE%% | *} # Remove site name (alternate) # Get content (basic extraction) curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys class ArticleExtractor(HTMLParser): def __init__(self): super().__init__() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'} def handle_starttag(self, tag, attrs): if tag not in self.skip_tags: if tag in {'p', 'article', 'main'}: self.in_content = True if tag in {'h1', 'h2', 'h3'}: self.content.append('\n') def handle_data(self, data): if self.in_content and data.strip(): self.content.append(data.strip()) def get_content(self): return '\n\n'.join(self.content) parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac # Clean filename FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt" # Move to final filename mv temp_article.txt "$FILENAME" # Show result echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"

Error Handling

Common Issues

1. Tool not installed

  • Try alternate tool (reader → trafilatura → fallback)
  • Offer to install: "Install reader with: npm install -g reader-cli"

2. Paywall or login required

  • Extraction tools may fail
  • Inform user: "This article requires authentication. Cannot extract."

3. Invalid URL

  • Check URL format
  • Try with and without redirects

4. No content extracted

  • Site may use heavy JavaScript
  • Try fallback method
  • Inform user if extraction fails

5. Special characters in title

  • Clean title for filesystem
  • Remove: /, :, ?, ", <, >, |
  • Replace with - or remove

Output Format

Saved File Contains:

  • Article title (if available)
  • Author (if available from tool)
  • Main article text
  • Section headings
  • No navigation, ads, or clutter

What Gets Removed:

  • Navigation menus
  • Ads and promotional content
  • Newsletter signup forms
  • Related articles sidebars
  • Comment sections (optional)
  • Social media buttons
  • Cookie notices

Tips for Best Results

1. Use reader for most articles

  • Best all-around tool
  • Based on Firefox Reader View
  • Works on most news sites and blogs

2. Use trafilatura for:

  • Academic articles
  • News sites
  • Blogs with complex layouts
  • Non-English content

3. Fallback method limitations:

  • May include some noise
  • Less accurate paragraph detection
  • Better than nothing for simple sites

4. Check extraction quality:

  • Always show preview to user
  • Ask if it looks correct
  • Offer to try different tool if needed

Example Usage

Simple extraction:

# User: "Extract https://example.com/article" reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"

With error handling:

if ! reader "$URL" > temp.txt 2>/dev/null; then if command -v trafilatura &> /dev/null; then trafilatura --URL "$URL" --output-format txt > temp.txt else echo "Error: Could not extract article. Install reader or trafilatura." exit 1 fi fi

Best Practices

  • ✅ Always show preview after extraction (first 10 lines)
  • ✅ Verify extraction succeeded before saving
  • ✅ Clean filename for filesystem compatibility
  • ✅ Try fallback method if primary fails
  • ✅ Inform user which tool was used
  • ✅ Keep filename length reasonable (< 100 chars)

After Extraction

Display to user:

  1. "✓ Extracted: [Article Title]"
  2. "✓ Saved to: [filename]"
  3. Show preview (first 10-15 lines)
  4. File size and location

Ask if needed:

  • "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
  • "Should I extract another article?"
五维分析
清晰度8/10
创新性3/10
实用性9/10
完整性9/10
可维护性7/10
优缺点分析

优点

  • 针对常见任务实用性高
  • 文档清晰、步骤详细
  • 提供多种后备方法,鲁棒性强

缺点

  • 没有新颖的实现,仅是调用现有工具
  • 执行任意命令存在安全风险
  • 可维护性依赖于外部CLI工具

相关技能

pytorch

S
toolCode Lib / 代码库
92/ 100

“它是深度学习的瑞士军刀,但祝你好运能从47种安装方法里找到那个不会搞崩你系统的那一个。”

agno

S
toolCode Lib / 代码库
90/ 100

“它承诺成为智能体领域的Kubernetes,但得看开发者有没有耐心学习又一个编排层。”

nuxt-skills

S
toolCo-Pilot / 辅助式
90/ 100

“这本质上是一份组织良好的小抄,能把你的 AI 助手变成一只 Nuxt 框架的复读机。”

免责声明:本内容来源于 GitHub 开源项目,仅供展示和评分分析使用。

版权归原作者所有 michalparkola.