AI Search and Retrieval for Web Crawlers

Bridge the gap between your web crawl and AI language models using Model Context Protocol (MCP). With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously, extracting insights from your web content.

Support for WARC, wget, InterroBot, Katana, and SiteOne crawlers is available out of the gate. The server includes a full-text search interface with boolean support, resource filtering by type, HTTP status, and more. mcp-server-webcrawl provides the LLM a complete menu with which to search your web content.

mcp-server-webcrawl is free and open source, and requires Claude Desktop and Python (>=3.10). It is installed on the command line, via pip install:

pip install mcp-server-webcrawl

Main Features

  • Claude Desktop ready
  • Multi-crawler compatible
  • Filter by type, status, and more
  • Boolean search support
  • Support for Markdown and snippets
  • Roll your own website knowledgebase

Getting Started

Setup videos are available for each supported crawler, showing how to connect your crawl data to your LLM.

If you prefer text-only as opposed to video, step action guides are available within the mcp-server-webcrawl documentation.

MCP Configuration

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "wget", "--datasrc",
         "/path/to/wget/archives/"]
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
# from /path/to/wget/archives/ as current working direcory
# --adjust-extension for file extensions, e.g. *.html
$ wget --mirror https://example.com
$ wget --mirror https://example.com --adjust-extension
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "warc", "--datasrc",
         "/path/to/warc/archives/"]
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
# from /path/to/warc/archives/ as current working direcory
$ wget --warc-file=example --recursive https://example.com
$ wget --warc-file=example --recursive --page-requisites https://example.com
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "interrobot", "--datasrc",
         "[homedir]/Documents/InterroBot/interrobot.v2.db"]
    }
  }
}

# crawls executed in InterroBot (windowed)
# Windows: replace [homedir] with /Users/...
# macOS: path provided on InterroBot settings page
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "katana", "--datasrc",
         "/path/to/katana/crawls/"]
    }
  }
}

# tested configurations (macOS Terminal/Powershell/WSL)
# -store-response to save crawl contents
# -store-response-dir allows for expansion of hosts
#    consistent with default Katana behavior to 
#     spread assets across host directories
$ katana -u https://example.com -store-response -store-response-dir /path/to/katana/crawls/example.com/
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "siteone", "--datasrc",
         "/path/to/siteone/archives/"]
    }
  }
}

# crawls executed in SiteOne (windowed)
# *Generate offline website* must be checked

From Claude's developer settings, find the MCP configuration to include your crawl. Open in a text editor and modify the example to reflect your datasrc path.

You can set up more mcp-server-webcrawl connections under mcpServers if you want.

For additional technical information, including crawler feature support, be sure to check out help.

Abstraction of LLM clients (Claude and OpenAI) communicating with a website archive