mcp-server-webcrawl | MCP server for web crawlers

AI Search and Retrieval for Web Crawlers

With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously.

Support for multiple crawlers, including ArchiveBox, HTTrack, InterroBot, Katana, SiteOne, WARC, and wget is baked in.

The server includes a full-text search interface with boolean support, filtering by type, HTTP status, and more.

Main Features

Claude Desktop ready
Multi-crawler compatible
Filter by type, status, and more

Boolean search support
Support for Markdown and snippets
Roll your own website knowledgebase

Getting Started

Select a crawler for setup and configuration information.

ArchiveBox HTTrack InterroBot Katana SiteOne WARC wget

Installation

mcp-server-webcrawl requires Claude Desktop (or compatable MCP host) and Python (>=3.10). Install it on the command line, via pip install:

pip install mcp-server-webcrawl

Watch the setup video or if you prefer text-only, a ArchiveBox MCP step-action guide is available.

MCP Configuration (ArchiveBox)

From Claude's Developer Settings, find the local MCP configuration to include your crawl data. Open in a text editor and modify the example to reflect your command and datasrc path.

For additional technical information, including crawler feature support, check out help.

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "archivebox", "--datasrc",
         "/path/to/archivebox-data/"]
    }
  }
}

# tested configurations (macOS/Linux)
# each collection appears as a separate "site" in MCP
$ mkdir ~/archivebox-data/example && cd ~/archivebox-data/example
$ archivebox init && archivebox add https://example.com

Installation

mcp-server-webcrawl requires Claude Desktop (or compatable MCP host) and Python (>=3.10). Install it on the command line, via pip install:

pip install mcp-server-webcrawl

Watch the setup video or if you prefer text-only, a HTTrack MCP step-action guide is available.

MCP Configuration (HTTrack)

From Claude's Developer Settings, find the local MCP configuration to include your crawl data. Open in a text editor and modify the example to reflect your command and datasrc path.

For additional technical information, including crawler feature support, check out help.

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "httrack", "--datasrc",
         "/path/to/httrack/projects/"]
    }
  }
}

# crawls executed in HTTrack (windowed)
# creates organized project directories
# under specified location (typically "My Web Sites" on Windows
# or "websites" on macOS/Linux)

Installation

mcp-server-webcrawl requires Claude Desktop (or compatable MCP host) and Python (>=3.10). Install it on the command line, via pip install:

pip install mcp-server-webcrawl

Watch the setup video or if you prefer text-only, a InterroBot MCP step-action guide is available.

MCP Configuration (InterroBot)

From Claude's Developer Settings, find the local MCP configuration to include your crawl data. Open in a text editor and modify the example to reflect your command and datasrc path.

For additional technical information, including crawler feature support, check out help.

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "interrobot", "--datasrc",
         "[homedir]/Documents/InterroBot/interrobot.v2.db"]
    }
  }
}

# crawls executed in InterroBot (windowed)
# Windows: replace [homedir] with /Users/...
# macOS: path provided on InterroBot settings page

Installation

mcp-server-webcrawl requires Claude Desktop (or compatable MCP host) and Python (>=3.10). Install it on the command line, via pip install:

pip install mcp-server-webcrawl

Watch the setup video or if you prefer text-only, a Katana MCP step-action guide is available.

MCP Configuration (Katana)

From Claude's Developer Settings, find the local MCP configuration to include your crawl data. Open in a text editor and modify the example to reflect your command and datasrc path.

For additional technical information, including crawler feature support, check out help.

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "katana", "--datasrc",
         "/path/to/katana/crawls/"]
    }
  }
}

# tested configurations (macOS Terminal/Powershell/WSL)
# -store-response to save crawl contents
# -store-response-dir allows for expansion of hosts
#    consistent with default Katana behavior to
#     spread assets across host directories
$ katana -u https://example.com -store-response -store-response-dir /path/to/katana/crawls/example.com/

Installation

mcp-server-webcrawl requires Claude Desktop (or compatable MCP host) and Python (>=3.10). Install it on the command line, via pip install:

pip install mcp-server-webcrawl

Watch the setup video or if you prefer text-only, a SiteOne MCP step-action guide is available.

MCP Configuration (SiteOne)

From Claude's Developer Settings, find the local MCP configuration to include your crawl data. Open in a text editor and modify the example to reflect your command and datasrc path.

For additional technical information, including crawler feature support, check out help.

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "siteone", "--datasrc",
         "/path/to/siteone/archives/"]
    }
  }
}

# crawls executed in SiteOne (windowed)
# *Generate offline website* must be checked

Installation

mcp-server-webcrawl requires Claude Desktop (or compatable MCP host) and Python (>=3.10). Install it on the command line, via pip install:

pip install mcp-server-webcrawl

Watch the setup video or if you prefer text-only, a WARC MCP step-action guide is available.

MCP Configuration (WARC)

From Claude's Developer Settings, find the local MCP configuration to include your crawl data. Open in a text editor and modify the example to reflect your command and datasrc path.

For additional technical information, including crawler feature support, check out help.

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "warc", "--datasrc",
         "/path/to/warc/archives/"]
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
# from /path/to/warc/archives/ as current working directory
$ wget --warc-file=example --recursive https://example.com
$ wget --warc-file=example --recursive --page-requisites https://example.com

Installation

mcp-server-webcrawl requires Claude Desktop (or compatable MCP host) and Python (>=3.10). Install it on the command line, via pip install:

pip install mcp-server-webcrawl

Watch the setup video or if you prefer text-only, a wget MCP step-action guide is available.

MCP Configuration (wget)

From Claude's Developer Settings, find the local MCP configuration to include your crawl data. Open in a text editor and modify the example to reflect your command and datasrc path.

For additional technical information, including crawler feature support, check out help.

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "wget", "--datasrc",
         "/path/to/wget/archives/"]
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
# from /path/to/wget/archives/ as current working directory
# --adjust-extension for file extensions, e.g. *.html
$ wget --mirror https://example.com
$ wget --mirror https://example.com --adjust-extension

Interactive Mode

mcp-server-webcrawl is at its heart, a search interface. Interactive mode takes that search, normally exposed to an MCP host, and presents it as a terminal user interface (TUI).

It has all the Boolean and field search capabilities as you'd expect, but in full manual drive. TUIs can run locally or remotely (over SSH), making mcp-server-webcrawl a practical choice for searching remote crawl data.

Abstraction of LLM clients (Claude and OpenAI) communicating with a website archive