Scrap MCP
An MCP (Model Context Protocol) server that can scrape web pages and extract content using CSS selectors. Built with deno-dom for fast HTML parsing.
Installation
npx scrap-mcpAsk AI about Scrap MCP
Powered by Claude Β· Grounded in docs
I know everything about Scrap MCP. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Web Scraper MCP Server
An MCP (Model Context Protocol) server that can scrape web pages and extract content using CSS selectors. Built with deno-dom for fast HTML parsing.
Why
Most LLM clients already have some HTTP fetching capabilities, but fetching a page directly often returns a lot of unnecessary content. This not only confuses the LLM, but also quickly fills up the context window.
That's where this MCP comes inβit enables targeted scraping using CSS selectors, so you only extract the content you actually need.
Example from Zed
See ZedExample.md for a real-world usage example.
Features
- π Fetch any publicly accessible web page by URL
- π Parse HTML content using the fast deno-dom library
- π Extract text content using standard CSS selectors
- π― Support for complex selectors (classes, IDs, attributes, pseudo-selectors)
- β‘ Built-in error handling for network issues and parsing failures
- π‘οΈ Safe execution with minimal required permissions
Prerequisites
- Deno installed on your system
- Network access for fetching web pages
Running the Server
deno run --allow-net jsr:@sigma/scrap-mcp
You can also run this with Bun and Node.js using bunx and npx respectively:
bunx rjsr @sigma/scrap-mcp
npx rjsr @sigma/scrap-mcp
MCP Tool Reference
scrape_page
The main tool for scraping web pages and extracting content.
Parameters:
url(string, required): The URL of the page to scrapequery_selector(string, required): CSS selector to query elements
Return Format:
Found X elements matching selector "SELECTOR" on URL:
Element 1: TEXT_CONTENT
Element 2: TEXT_CONTENT
...
Usage Examples
Basic Selectors
-
Extract all headings:
{ "url": "https://example.com", "query_selector": "h1, h2, h3" } -
Extract all paragraphs:
{ "url": "https://example.com", "query_selector": "p" } -
Extract content from specific classes:
{ "url": "https://news.ycombinator.com", "query_selector": ".titleline > a" } -
Extract all links:
{ "url": "https://example.com", "query_selector": "a" }
Advanced Selectors
-
Extract navigation items:
{ "url": "https://deno.land", "query_selector": "nav a" } -
Extract elements with specific attributes:
{ "url": "https://example.com", "query_selector": "a[href^='https://']" } -
Extract form inputs:
{ "url": "https://example.com", "query_selector": "input[type='text'], input[type='email']" }
CSS Selector Reference
Basic Selectors
h1- All H1 headings.className- All elements with class "className"#elementId- Element with ID "elementId"*- All elements
Combinators
div p- All paragraphs inside div elementsdiv > p- Direct paragraph children of div elementsh1 + p- Paragraphs immediately following H1 elementsh1 ~ p- All paragraphs that are siblings after H1 elements
Attribute Selectors
[href]- All elements with href attributea[title]- All links with title attributea[href^="https://"]- Links starting with "https://"a[href$=".pdf"]- Links ending with ".pdf"a[href*="github"]- Links containing "github"
Pseudo-selectors
li:first-child- First list itemli:last-child- Last list itemli:nth-child(2n)- Even-numbered list itemsp:not(.special)- Paragraphs without "special" class
Complex Examples
.article-content p, .article-content h2- Paragraphs and H2s in article contentnav ul li a- Navigation linkstable tr:nth-child(odd) td- Cells in odd table rowsform input[required]- Required form inputs
Dependencies
@modelcontextprotocol/sdk@1.8.0- MCP SDK for server implementation@b-fuze/deno-dom@^0.1.49- Fast DOM parser for HTML contentzod@3.24.2- Runtime type validation and schema definition
Error Handling
The server provides comprehensive error handling for:
- Network Issues: Invalid URLs, connection timeouts, DNS failures
- HTTP Errors: 404 Not Found, 403 Forbidden, 500 Server Error, etc.
- Parsing Failures: Malformed HTML, encoding issues
- Selector Issues: Invalid CSS selectors, no matching elements
- Content Issues: Elements found but no text content available
All errors are returned as readable text messages through the MCP protocol.
Best Practices
- Respect robots.txt: Always check the target site's robots.txt file
- Add delays: Use reasonable delays between requests to avoid overwhelming servers
- User-Agent: The scraper uses Deno's default User-Agent
Security
Required Permissions
--allow-net- To fetch web pages from the internet
Security Features
- No arbitrary code execution: Only CSS selectors are accepted, no JavaScript
- Network sandboxing: Only outbound HTTP/HTTPS requests allowed
- Input validation: All inputs validated using Zod schemas
Troubleshooting
"Permission denied" errors:
# Ensure all required permissions are granted
deno run --allow-net jsr:@sigma/scrap-mcp
"No elements found" with valid selector:
- The page might load content dynamically with JavaScript
- Try different selectors or inspect the actual HTML source
- Some sites block automated requests
License
MIT License - see LICENSE file for details
