📦

io.github.jztan/pdf-mcp

Production-ready MCP server for PDF processing with intelligent caching.

0 installs

Trust: 37 — Low

Content

Ask AI about io.github.jztan/pdf-mcp

I know everything about io.github.jztan/pdf-mcp. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

pdf-mcp

A Model Context Protocol (MCP) server that enables AI agents to read, search, and extract content from PDF files. Built with Python and PyMuPDF, with SQLite-based caching for persistence across server restarts.

mcp-name: io.github.jztan/pdf-mcp

Try it in your browser

See what your AI agent sees →

Walk through the three main tools (pdf_info, pdf_search, pdf_read_pages) with any PDF. 100% client-side, no install required.

Features

Give your agent surgical access to PDFs instead of flooding context with raw text.

Hybrid search — find relevant pages with a question, not a page range. Combines BM25 keyword and semantic search via Reciprocal Rank Fusion
Paginated reading — fetch only the pages your agent needs; large documents don't blow your context window
OCR — scanned and image-based PDFs are fully readable and searchable via Tesseract
Structured extraction — tables, embedded images, and table of contents returned as structured data, not text soup
Persistent cache — SQLite-backed; re-reads are instant and survive server restarts
Secure URL fetching — HTTPS-only with SSRF protection; local network ranges are blocked

Installation

pip install pdf-mcp

For semantic search (adds fastembed and numpy, ~67 MB model download on first use):

pip install 'pdf-mcp[semantic]'

For OCR on scanned PDFs (requires system Tesseract):

# macOS
brew install tesseract

# Ubuntu/Debian
apt install tesseract-ocr

# Windows — download the installer from:
# https://github.com/UB-Mannheim/tesseract/wiki
# Then add the install directory to your PATH.

Quick Start

Choose your MCP client below to get started:

Claude Code

claude mcp add pdf-mcp -- pdf-mcp

Or add to ~/.claude.json:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Config file location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

Restart Claude Desktop after updating the config.

Visual Studio Code

Requires VS Code 1.101+ with GitHub Copilot.

CLI:

code --add-mcp '{"name":"pdf-mcp","command":"pdf-mcp"}'

Command Palette:

Open Command Palette (Cmd/Ctrl+Shift+P)
Run MCP: Open User Configuration (global) or MCP: Open Workspace Folder Configuration (project-specific)

Add the configuration:

{
  "servers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Save. VS Code will automatically load the server.

Manual: Create .vscode/mcp.json in your workspace:

{
  "servers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Codex CLI

codex mcp add pdf-mcp -- pdf-mcp

Or configure manually in ~/.codex/config.toml:

[mcp_servers.pdf-mcp]
command = "pdf-mcp"

Kiro

Create or edit .kiro/settings/mcp.json in your workspace:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp",
      "args": [],
      "disabled": false
    }
  }
}

Save and restart Kiro.

Other MCP Clients

Most MCP clients use a standard configuration format:

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

With uvx (for isolated environments):

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "uvx",
      "args": ["pdf-mcp"]
    }
  }
}

Verify Installation

pdf-mcp --help

Tools

`pdf_info` — Get Document Information

Returns page count, metadata, file size, estimated token count, and text_coverage — a per-page list of {page, text_chars, raster_images} that lets agents identify OCR candidates without reading content. Call this first to understand a document. Includes toc_entry_count and inline TOC entries when the document has ≤50 bookmarks; larger TOCs return toc_truncated: true — use pdf_get_toc to retrieve the full outline.

"Read the PDF at /path/to/document.pdf"

`pdf_read_pages` — Read Specific Pages

Read selected pages to manage context size. Each page dict includes text, images/image_count, and tables/table_count. Tables are extracted as structured data (header + rows) and inlined directly in the page response — no separate tool call needed.

Optional parameters:

ocr=True / ocr_lang="eng" — run Tesseract OCR on pages with no extractable text; requires system Tesseract (brew install tesseract); capped at 20 pages per call
render_dpi=200 — attach a rendered PNG path alongside text for each page (shares cache with pdf_render_pages)

"Read pages 1-10 of the PDF"
"Read pages 15, 20, and 25-30"
"OCR pages 3-5 of the scanned PDF"

`pdf_read_all` — Read Entire Document

Read a complete document in one call. Best for short documents (~50 pages or fewer) where you want everything at once. Does not include images or tables — use pdf_read_pages for those.

Optional parameters:

max_pages=50 — safety cap on pages read (default 50, max 500)

"Read the entire PDF (it's only 10 pages)"

`pdf_render_pages` — Render Pages as Images

Render PDF pages as PNG images for vision-capable models. Use when you need to see page content — diagrams, handwriting, scanned pages, or any page where text extraction is insufficient. Returns MCP image content blocks that vision models can process natively. Up to 5 pages per call; DPI clamped to 72–400.

For extracting text from scanned pages, use pdf_read_pages(ocr=True) instead — the two tools are orthogonal.

"Show me what page 5 looks like"
"Render the diagram on page 12"

`pdf_search` — Search Within PDF

Find relevant pages before loading content. The default mode is hybrid — Reciprocal Rank Fusion (RRF) merges BM25 keyword results and semantic embedding results into a single ranked list. This consistently outperforms either method alone: keyword search finds exact terms that embeddings miss; semantic search finds conceptual matches that keyword search misses; RRF fusion captures both.

Three modes are available:

mode="auto" (default) — Hybrid RRF when pdf-mcp[semantic] is installed; keyword-only fallback otherwise.
mode="keyword" — BM25/FTS5 only. Best for exact identifiers, product codes, precise terms.
mode="semantic" — Semantic only (requires pdf-mcp[semantic]). Best for conceptual queries.

Response includes search_mode: "hybrid" | "keyword" | "semantic" indicating which path ran.

The first call on a new document embeds all pages (one-time cost, typically a few seconds for large documents); subsequent calls are instant.

"Search for 'quarterly revenue' in the PDF"
"Find pages about revenue growth in the PDF"
"Which pages discuss supply chain risks?"

`pdf_get_toc` — Get Table of Contents

Returns the full outline with titles, levels, and page numbers. Use when pdf_info returns toc_truncated: true (documents with more than 50 bookmarks).

"Show me the table of contents"

`pdf_cache_stats` — View Cache Statistics

Returns a breakdown of what's cached per document — page text, images, tables, embeddings, and rendered PNGs — plus total cache size and hit counts.

"Show PDF cache statistics"

`pdf_cache_clear` — Clear Cache

Removes expired or all cache entries. Use when cached content is stale or to free disk space.

"Clear expired PDF cache entries"

Example Workflow

For a large document (e.g., a 200-page annual report):

User: "Summarize the risk factors in this annual report"

Agent workflow:
1. pdf_info("report.pdf")
   → 200 pages, TOC shows "Risk Factors" on page 89

2. pdf_search("report.pdf", "risk factors")
   → Relevant pages: 89-110

3. pdf_read_pages("report.pdf", "89-100")
   → First batch

4. pdf_read_pages("report.pdf", "101-110")
   → Second batch

5. Synthesize answer from chunks

Caching

The server uses SQLite for persistent caching. This is necessary because MCP servers using STDIO transport are spawned as a new process for each conversation.

Cache location: ~/.cache/pdf-mcp/cache.db

What's cached:

Data	Benefit
Metadata + text coverage	Avoid re-parsing document info
Page text	Skip re-extraction
Images	Skip re-encoding
Tables	Skip re-detection
TOC	Skip re-parsing
FTS5 index	O(log N) search with BM25 ranking after first query
Embeddings	Instant semantic search after first indexing run
Rendered PNGs	Skip re-rendering; shared between `pdf_render_pages` and `pdf_read_pages(render_dpi=…)`

Cache invalidation:

Automatic when file modification time changes
Manual via the pdf_cache_clear tool
TTL: 24 hours (configurable)

Configuration

Access control (optional)

Create ~/.config/pdf-mcp/config.toml to restrict which local paths and URL hosts the server will access. The file is optional — if absent, the server is permissive within the built-in SSRF floor (HTTPS-only, blocked private IP ranges).

[paths]
allow = ["~/Documents/**", "/data/pdfs/**"]
deny  = ["~/.ssh/**", "~/.aws/**"]

[urls]
allow = ["*.internal.example.com"]
deny  = ["untrusted.example.com"]

Rules use shell-glob patterns (* matches across path separators). deny wins when both match. Path matching operates on the resolved path after symlink expansion. A malformed config file prevents the server from starting — it never silently falls back to permissive.

Environment variables

# Cache directory (default: ~/.cache/pdf-mcp)
PDF_MCP_CACHE_DIR=/path/to/cache

# Cache TTL in hours (default: 24)
PDF_MCP_CACHE_TTL=48

Development

git clone https://github.com/jztan/pdf-mcp.git
cd pdf-mcp

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Type checking
mypy src/

# Linting
flake8 src/ tests/

# Formatting
black src/ tests/

Why pdf-mcp?

	Without pdf-mcp	With pdf-mcp
Large PDFs	Context overflow	Chunked reading
Token budgeting	Guess and overflow	Estimated tokens before reading
Finding content	Load everything	Hybrid search — RRF fusion of BM25 keyword (FTS5) + semantic embeddings; never misses what either alone would
Tables	Lost in raw text	Extracted and inlined per page
Images	Ignored	Extracted as PNG files
Repeated access	Re-parse every time	SQLite cache
Scanned PDFs	No text extracted	OCR via Tesseract (`pdf_read_pages(ocr=True)`)
Visual content	Must describe in words	Render page as image (`pdf_render_pages`)
Tool design	Single monolithic tool	8 specialized tools

Roadmap

See ROADMAP.md for planned features and release history.

Contributing

Contributions are welcome. Please submit a pull request.

License

MIT — see LICENSE.

io.github.jztan/pdf-mcp

Reviews

Documentation

pdf-mcp

Try it in your browser

Features

Installation

Quick Start

Verify Installation

Tools

`pdf_info` — Get Document Information

`pdf_read_pages` — Read Specific Pages

`pdf_read_all` — Read Entire Document

`pdf_render_pages` — Render Pages as Images

`pdf_search` — Search Within PDF

`pdf_get_toc` — Get Table of Contents

`pdf_cache_stats` — View Cache Statistics

`pdf_cache_clear` — Clear Cache

Example Workflow

Caching

Configuration

Access control (optional)

Environment variables

Development

Why pdf-mcp?

Roadmap

Contributing

License

Links

Security Checklist

io.github.jztan/pdf-mcp

Reviews

Documentation

pdf-mcp

Try it in your browser

Features

Installation

Quick Start

Verify Installation

Tools

pdf_info — Get Document Information

pdf_read_pages — Read Specific Pages

pdf_read_all — Read Entire Document

pdf_render_pages — Render Pages as Images

pdf_search — Search Within PDF

pdf_get_toc — Get Table of Contents

pdf_cache_stats — View Cache Statistics

pdf_cache_clear — Clear Cache

Example Workflow

Caching

Configuration

Access control (optional)

Environment variables

Development

Why pdf-mcp?

Roadmap

Contributing

License

Links

Security Checklist

`pdf_info` — Get Document Information

`pdf_read_pages` — Read Specific Pages

`pdf_read_all` — Read Entire Document

`pdf_render_pages` — Render Pages as Images

`pdf_search` — Search Within PDF

`pdf_get_toc` — Get Table of Contents

`pdf_cache_stats` — View Cache Statistics

`pdf_cache_clear` — Clear Cache