io.github.jztan/pdf-mcp
Production-ready MCP server for PDF processing with intelligent caching.
Ask AI about io.github.jztan/pdf-mcp
Powered by Claude Β· Grounded in docs
I know everything about io.github.jztan/pdf-mcp. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
pdf-mcp
A Model Context Protocol (MCP) server that enables AI agents to read, search, and extract content from PDF files. Built with Python and PyMuPDF, with SQLite-based caching for persistence across server restarts.
mcp-name: io.github.jztan/pdf-mcp
Try it in your browser
See what your AI agent sees β
Walk through the three main tools (pdf_info, pdf_search, pdf_read_pages) with any PDF. 100% client-side, no install required.
Features
Give your agent surgical access to PDFs instead of flooding context with raw text.
- Hybrid search β find relevant pages with a question, not a page range. Combines BM25 keyword and semantic search via Reciprocal Rank Fusion
- Paginated reading β fetch only the pages your agent needs; large documents don't blow your context window
- OCR β scanned and image-based PDFs are fully readable and searchable via Tesseract
- Structured extraction β tables, embedded images, and table of contents returned as structured data, not text soup
- Persistent cache β SQLite-backed; re-reads are instant and survive server restarts
- Secure URL fetching β HTTPS-only with SSRF protection; local network ranges are blocked
Installation
pip install pdf-mcp
For semantic search (adds fastembed and numpy, ~67 MB model download on first use):
pip install 'pdf-mcp[semantic]'
For OCR on scanned PDFs (requires system Tesseract):
# macOS
brew install tesseract
# Ubuntu/Debian
apt install tesseract-ocr
# Windows β download the installer from:
# https://github.com/UB-Mannheim/tesseract/wiki
# Then add the install directory to your PATH.
Quick Start
Choose your MCP client below to get started:
Claude Code
claude mcp add pdf-mcp -- pdf-mcp
Or add to ~/.claude.json:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
Config file location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
Restart Claude Desktop after updating the config.
Visual Studio Code
Requires VS Code 1.101+ with GitHub Copilot.
CLI:
code --add-mcp '{"name":"pdf-mcp","command":"pdf-mcp"}'
Command Palette:
- Open Command Palette (
Cmd/Ctrl+Shift+P) - Run
MCP: Open User Configuration(global) orMCP: Open Workspace Folder Configuration(project-specific) - Add the configuration:
{ "servers": { "pdf-mcp": { "command": "pdf-mcp" } } } - Save. VS Code will automatically load the server.
Manual: Create .vscode/mcp.json in your workspace:
{
"servers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
Codex CLI
codex mcp add pdf-mcp -- pdf-mcp
Or configure manually in ~/.codex/config.toml:
[mcp_servers.pdf-mcp]
command = "pdf-mcp"
Kiro
Create or edit .kiro/settings/mcp.json in your workspace:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp",
"args": [],
"disabled": false
}
}
}
Save and restart Kiro.
Other MCP Clients
Most MCP clients use a standard configuration format:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
With uvx (for isolated environments):
{
"mcpServers": {
"pdf-mcp": {
"command": "uvx",
"args": ["pdf-mcp"]
}
}
}
Verify Installation
pdf-mcp --help
Tools
pdf_info β Get Document Information
Returns page count, metadata, file size, estimated token count, and text_coverage β a per-page list of {page, text_chars, raster_images} that lets agents identify OCR candidates without reading content. Call this first to understand a document. Includes toc_entry_count and inline TOC entries when the document has β€50 bookmarks; larger TOCs return toc_truncated: true β use pdf_get_toc to retrieve the full outline.
"Read the PDF at /path/to/document.pdf"
pdf_read_pages β Read Specific Pages
Read selected pages to manage context size. Each page dict includes text, images/image_count, and tables/table_count. Tables are extracted as structured data (header + rows) and inlined directly in the page response β no separate tool call needed.
Optional parameters:
ocr=True/ocr_lang="eng"β run Tesseract OCR on pages with no extractable text; requires system Tesseract (brew install tesseract); capped at 20 pages per callrender_dpi=200β attach a rendered PNG path alongside text for each page (shares cache withpdf_render_pages)
"Read pages 1-10 of the PDF"
"Read pages 15, 20, and 25-30"
"OCR pages 3-5 of the scanned PDF"
pdf_read_all β Read Entire Document
Read a complete document in one call. Best for short documents (~50 pages or fewer) where you want everything at once. Does not include images or tables β use pdf_read_pages for those.
Optional parameters:
max_pages=50β safety cap on pages read (default 50, max 500)
"Read the entire PDF (it's only 10 pages)"
pdf_render_pages β Render Pages as Images
Render PDF pages as PNG images for vision-capable models. Use when you need to see page content β diagrams, handwriting, scanned pages, or any page where text extraction is insufficient. Returns MCP image content blocks that vision models can process natively. Up to 5 pages per call; DPI clamped to 72β400.
For extracting text from scanned pages, use pdf_read_pages(ocr=True) instead β the two tools are orthogonal.
"Show me what page 5 looks like"
"Render the diagram on page 12"
pdf_search β Search Within PDF
Find relevant pages before loading content. The default mode is hybrid β Reciprocal Rank Fusion (RRF) merges BM25 keyword results and semantic embedding results into a single ranked list. This consistently outperforms either method alone: keyword search finds exact terms that embeddings miss; semantic search finds conceptual matches that keyword search misses; RRF fusion captures both.
Three modes are available:
mode="auto"(default) β Hybrid RRF whenpdf-mcp[semantic]is installed; keyword-only fallback otherwise.mode="keyword"β BM25/FTS5 only. Best for exact identifiers, product codes, precise terms.mode="semantic"β Semantic only (requirespdf-mcp[semantic]). Best for conceptual queries.
Response includes search_mode: "hybrid" | "keyword" | "semantic" indicating which path ran.
The first call on a new document embeds all pages (one-time cost, typically a few seconds for large documents); subsequent calls are instant.
"Search for 'quarterly revenue' in the PDF"
"Find pages about revenue growth in the PDF"
"Which pages discuss supply chain risks?"
pdf_get_toc β Get Table of Contents
Returns the full outline with titles, levels, and page numbers. Use when pdf_info returns toc_truncated: true (documents with more than 50 bookmarks).
"Show me the table of contents"
pdf_cache_stats β View Cache Statistics
Returns a breakdown of what's cached per document β page text, images, tables, embeddings, and rendered PNGs β plus total cache size and hit counts.
"Show PDF cache statistics"
pdf_cache_clear β Clear Cache
Removes expired or all cache entries. Use when cached content is stale or to free disk space.
"Clear expired PDF cache entries"
Example Workflow
For a large document (e.g., a 200-page annual report):
User: "Summarize the risk factors in this annual report"
Agent workflow:
1. pdf_info("report.pdf")
β 200 pages, TOC shows "Risk Factors" on page 89
2. pdf_search("report.pdf", "risk factors")
β Relevant pages: 89-110
3. pdf_read_pages("report.pdf", "89-100")
β First batch
4. pdf_read_pages("report.pdf", "101-110")
β Second batch
5. Synthesize answer from chunks
Caching
The server uses SQLite for persistent caching. This is necessary because MCP servers using STDIO transport are spawned as a new process for each conversation.
Cache location: ~/.cache/pdf-mcp/cache.db
What's cached:
| Data | Benefit |
|---|---|
| Metadata + text coverage | Avoid re-parsing document info |
| Page text | Skip re-extraction |
| Images | Skip re-encoding |
| Tables | Skip re-detection |
| TOC | Skip re-parsing |
| FTS5 index | O(log N) search with BM25 ranking after first query |
| Embeddings | Instant semantic search after first indexing run |
| Rendered PNGs | Skip re-rendering; shared between pdf_render_pages and pdf_read_pages(render_dpi=β¦) |
Cache invalidation:
- Automatic when file modification time changes
- Manual via the
pdf_cache_cleartool - TTL: 24 hours (configurable)
Configuration
Access control (optional)
Create ~/.config/pdf-mcp/config.toml to restrict which local paths and URL hosts the server will access. The file is optional β if absent, the server is permissive within the built-in SSRF floor (HTTPS-only, blocked private IP ranges).
[paths]
allow = ["~/Documents/**", "/data/pdfs/**"]
deny = ["~/.ssh/**", "~/.aws/**"]
[urls]
allow = ["*.internal.example.com"]
deny = ["untrusted.example.com"]
Rules use shell-glob patterns (* matches across path separators). deny wins when both match. Path matching operates on the resolved path after symlink expansion. A malformed config file prevents the server from starting β it never silently falls back to permissive.
Environment variables
# Cache directory (default: ~/.cache/pdf-mcp)
PDF_MCP_CACHE_DIR=/path/to/cache
# Cache TTL in hours (default: 24)
PDF_MCP_CACHE_TTL=48
Development
git clone https://github.com/jztan/pdf-mcp.git
cd pdf-mcp
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Type checking
mypy src/
# Linting
flake8 src/ tests/
# Formatting
black src/ tests/
Why pdf-mcp?
| Without pdf-mcp | With pdf-mcp | |
|---|---|---|
| Large PDFs | Context overflow | Chunked reading |
| Token budgeting | Guess and overflow | Estimated tokens before reading |
| Finding content | Load everything | Hybrid search β RRF fusion of BM25 keyword (FTS5) + semantic embeddings; never misses what either alone would |
| Tables | Lost in raw text | Extracted and inlined per page |
| Images | Ignored | Extracted as PNG files |
| Repeated access | Re-parse every time | SQLite cache |
| Scanned PDFs | No text extracted | OCR via Tesseract (pdf_read_pages(ocr=True)) |
| Visual content | Must describe in words | Render page as image (pdf_render_pages) |
| Tool design | Single monolithic tool | 8 specialized tools |
Roadmap
See ROADMAP.md for planned features and release history.
Contributing
Contributions are welcome. Please submit a pull request.
License
MIT β see LICENSE.
Links
- pdf-mcp on PyPI
- pdf-mcp on GitHub
- How I Built pdf-mcp β The problem with large PDFs in AI agents and a working solution
- MCP Server Security: 8 Vulnerabilities β What we found when we audited an MCP server for security holes
- How Claude Code Actually Reads PDFs β How AI agents use pdf-mcp tools to read and navigate PDF documents
- Semantic vs Keyword Search for AI Agents β Benchmarks and a dual-search routing pattern: FTS5 for exact identifiers, embeddings for natural language
- Hybrid Search vs Query Routing for AI Agents β Why pdf-mcp uses hybrid RRF instead of query routing: benchmarks showing RRF wins across query types
