AI Search and Retrieval for Web Crawlers

Bridge the gap between your web crawl and AI language models using Model Context Protocol (MCP). With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously, extracting insights from your web content.

Support for WARC, wget, InterroBot, Katana, and SiteOne crawlers is available out of the gate. The server includes a full-text search interface with boolean support, resource filtering by type, HTTP status, and more. mcp-server-webcrawl provides the LLM a complete menu with which to search your web content.

mcp-server-webcrawl is free and open source, and requires Claude Desktop, Python (>=3.10). It is installed on the command line, via pip install:

pip install mcp-server-webcrawl

Main Features

  • Claude Desktop ready
  • Fulltext search support
  • Filter by type, status, and more
  • Supports wget, WARC, and more
  • Augment your LLM knowledgebase
  • ChatGPT support coming soon

MCP Configuration

{ 
  "mcpServers": {
    "webcrawl": {
      "command": "mcp-server-webcrawl",
       "args": ["--crawler", "wget", "--datasrc", 
         "/path/to/wget/archives/"]     
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
# --adjust-extension for file extensions, e.g. *.html
$ wget --mirror https://example.com
$ wget --mirror https://example.com --adjust-extension
{ 
  "mcpServers": {
    "webcrawl": {
      "command": "mcp-server-webcrawl",
       "args": ["--crawler", "warc", "--datasrc", 
         "/path/to/warc/archives/"]     
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
$ wget --warc-file=example --recursive https://example.com
$ wget --warc-file=example --recursive --page-requisites https://example.com
{ 
  "mcpServers": {
    "webcrawl": {
      "command": "mcp-server-webcrawl",
       "args": ["--crawler", "interrobot", "--datasrc", 
         "[homedir]/Documents/InterroBot/interrobot.v2.db"]
    }
  }
}

# crawls executed in InterroBot (windowed)
# Windows: replace [homedir] with /Users/...
# macOS: path provided on InterroBot settings page
{ 
  "mcpServers": {
    "webcrawl": {
      "command": "mcp-server-webcrawl",
       "args": ["--crawler", "katana", "--datasrc", 
         "/path/to/katana/crawls/"]
    }
  }
}

# tested configurations (macOS Terminal/Powershell/WSL)
# -store-response to save crawl contents
# -store-response-dir allows for many site crawls in one dir
$ katana -u https://example.com -store-response -store-response-dir crawls/
{ 
  "mcpServers": {
    "webcrawl": {
      "command": "mcp-server-webcrawl",
       "args": ["--crawler", "siteone", "--datasrc", 
         "/path/to/siteone/archives/"]     
    }
  }
}

# crawls executed in SiteOne (windowed)
# *Generate offline website* must be checked

From Claude's developer settings, find the MCP configuration to include your crawl. Open in a text editor and modify the example to reflect your datasrc path.

You can set up more mcp-server-webcrawl connections under mcpServers if you want.

Abstraction of LLM clients (Claude and OpenAI) communicating with a website archive