AI Search and Retrieval for Web Crawlers
Bridge the gap between your web crawl and AI language models using Model Context Protocol (MCP). With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously, extracting insights from your web content.
Support for WARC, wget, InterroBot, Katana, and SiteOne crawlers is available out of the gate. The server includes a full-text search interface with boolean support, resource filtering by type, HTTP status, and more. mcp-server-webcrawl provides the LLM a complete menu with which to search your web content.
mcp-server-webcrawl is free and open source, and requires Claude Desktop, Python (>=3.10). It is installed on the command line, via pip install:
pip install mcp-server-webcrawl
Main Features
- Claude Desktop ready
- Fulltext search support
- Filter by type, status, and more
- Supports wget, WARC, and more
- Augment your LLM knowledgebase
- ChatGPT support coming soon
MCP Configuration
{ "mcpServers": { "webcrawl": { "command": "mcp-server-webcrawl", "args": ["--crawler", "wget", "--datasrc", "/path/to/wget/archives/"] } } } # tested configurations (macOS Terminal/Windows WSL) # --adjust-extension for file extensions, e.g. *.html $ wget --mirror https://example.com $ wget --mirror https://example.com --adjust-extension
{ "mcpServers": { "webcrawl": { "command": "mcp-server-webcrawl", "args": ["--crawler", "warc", "--datasrc", "/path/to/warc/archives/"] } } } # tested configurations (macOS Terminal/Windows WSL) $ wget --warc-file=example --recursive https://example.com $ wget --warc-file=example --recursive --page-requisites https://example.com
{ "mcpServers": { "webcrawl": { "command": "mcp-server-webcrawl", "args": ["--crawler", "interrobot", "--datasrc", "[homedir]/Documents/InterroBot/interrobot.v2.db"] } } } # crawls executed in InterroBot (windowed) # Windows: replace [homedir] with /Users/... # macOS: path provided on InterroBot settings page
{ "mcpServers": { "webcrawl": { "command": "mcp-server-webcrawl", "args": ["--crawler", "katana", "--datasrc", "/path/to/katana/crawls/"] } } } # tested configurations (macOS Terminal/Powershell/WSL) # -store-response to save crawl contents # -store-response-dir allows for many site crawls in one dir $ katana -u https://example.com -store-response -store-response-dir crawls/
{ "mcpServers": { "webcrawl": { "command": "mcp-server-webcrawl", "args": ["--crawler", "siteone", "--datasrc", "/path/to/siteone/archives/"] } } } # crawls executed in SiteOne (windowed) # *Generate offline website* must be checked
From Claude's developer settings, find the MCP configuration to include your crawl. Open in a text editor and modify the example to reflect your datasrc path.
You can set up more mcp-server-webcrawl connections under mcpServers if you want.