pdf-server
MCP server for PDF processing - text extraction, search, and outline extraction
Installation
npx @paradyno/pdf-mcp-serverAsk AI about pdf-server
Powered by Claude Β· Grounded in docs
I know everything about pdf-server. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
π PDF MCP Server
A high-performance MCP server for PDF processing, built in Rust.
Give your AI agents powerful PDF capabilities β extract text, search, split, merge, encrypt, and more. All dependencies are Apache 2.0 licensed, keeping your project clean and permissive.
β¨ Features
| Category | Tools |
|---|---|
| π Reading | extract_text Β· extract_metadata Β· extract_outline Β· extract_annotations Β· extract_links Β· extract_form_fields |
| π Search & Discovery | search Β· list_pdfs Β· get_page_info Β· summarize_structure |
| πΌοΈ Media | Image extraction (via extract_text) Β· convert_page_to_image |
| βοΈ Manipulation | split_pdf Β· merge_pdfs Β· compress_pdf Β· fill_form |
| π Security | protect_pdf Β· unprotect_pdf Β· Password-protected PDF support Β· Path sandboxing Β· SSRF protection |
| π¦ Resources | Expose PDFs as MCP Resources for direct client access |
| β‘ Performance | Batch processing Β· LRU caching (with byte budget) Β· Operation chaining via cache keys |
π Installation
npm (Recommended)
npm install -g @paradyno/pdf-mcp-server
Pre-built Binaries
Download from GitHub Releases:
| Platform | x86_64 | ARM64 |
|---|---|---|
| π§ Linux | pdf-mcp-server-linux-x64 | pdf-mcp-server-linux-arm64 |
| π macOS | pdf-mcp-server-darwin-x64 | pdf-mcp-server-darwin-arm64 |
| πͺ Windows | pdf-mcp-server-windows-x64.exe | β |
From Source
cargo install --git https://github.com/paradyno/pdf-mcp-server
βοΈ Configuration
Claude Desktop
Add to your claude_desktop_config.json:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"pdf": {
"command": "npx",
"args": ["@paradyno/pdf-mcp-server"]
}
}
}
Claude Code
claude mcp add pdf -- npx @paradyno/pdf-mcp-server
VS Code
{
"mcp.servers": {
"pdf": {
"command": "npx",
"args": ["@paradyno/pdf-mcp-server"]
}
}
}
π οΈ Tools
Source Types
All tools accept PDF sources in multiple formats:
{ "path": "/documents/file.pdf" }
{ "base64": "JVBERi0xLjQK..." }
{ "url": "https://example.com/document.pdf" }
{ "cache_key": "abc123" }
π extract_text
Extract text content with LLM-optimized formatting (paragraph detection, multi-column reordering, watermark removal).
Example & Parameters
{
"sources": [{ "path": "/documents/report.pdf" }],
"pages": "1-10",
"include_metadata": true
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
pages | string | No | all | Page selection (e.g., "1-5,10,15-20") |
include_metadata | boolean | No | true | Include PDF metadata |
include_images | boolean | No | false | Include extracted images (base64 PNG) |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
π extract_outline
Extract PDF bookmarks / table of contents.
Example, Parameters & Response
{
"sources": [{ "path": "/documents/book.pdf" }]
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
Response:
{
"results": [{
"source": "/documents/book.pdf",
"outline": [
{
"title": "Chapter 1: Introduction",
"page": 1,
"children": [
{ "title": "1.1 Background", "page": 3, "children": [] }
]
}
]
}]
}
π extract_metadata
Extract PDF metadata (author, title, dates, etc.) without loading full content.
Example & Parameters
{
"sources": [{ "path": "/documents/report.pdf" }]
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
π extract_annotations
Extract highlights, comments, underlines, and other annotations.
Example & Parameters
{
"sources": [{ "path": "/documents/report.pdf" }],
"annotation_types": ["highlight", "text"],
"pages": "1-5"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
annotation_types | array | No | all | Filter by types (highlight, underline, text, etc.) |
pages | string | No | all | Page selection |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
π extract_links
Extract hyperlinks and internal page navigation links.
Example, Parameters & Response
{
"sources": [{ "path": "/documents/paper.pdf" }],
"pages": "1-10"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
pages | string | No | all | Page selection |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
Response:
{
"results": [{
"source": "/documents/paper.pdf",
"links": [
{ "page": 1, "url": "https://example.com", "text": "Click here" },
{ "page": 3, "dest_page": 10, "text": "See Chapter 5" }
],
"total_count": 2
}]
}
π extract_form_fields
Read form field names, types, current values, and properties from PDF forms.
Example, Parameters & Response
{
"sources": [{ "path": "/documents/form.pdf" }],
"pages": "1"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
pages | string | No | all | Page selection |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
Response:
{
"results": [{
"source": "/documents/form.pdf",
"fields": [
{
"page": 1,
"name": "full_name",
"field_type": "text",
"value": "John Doe",
"is_read_only": false,
"is_required": true,
"properties": { "is_multiline": false, "is_password": false }
},
{
"page": 1,
"name": "agree_terms",
"field_type": "checkbox",
"is_checked": true,
"is_read_only": false,
"is_required": false,
"properties": {}
}
],
"total_fields": 2
}]
}
πΌοΈ convert_page_to_image
Render PDF pages as PNG images (base64). Enables Vision LLMs to understand visual layouts, charts, and diagrams.
Example, Parameters & Response
{
"sources": [{ "path": "/documents/chart.pdf" }],
"pages": "1-3",
"width": 1200
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
pages | string | No | all | Page selection |
width | integer | No | 1200 | Target width in pixels |
height | integer | No | β | Target height in pixels |
scale | float | No | β | Scale factor (overrides width/height) |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
Response:
{
"results": [{
"source": "/documents/chart.pdf",
"pages": [
{
"page": 1,
"width": 1200,
"height": 1553,
"data_base64": "iVBORw0KGgo...",
"mime_type": "image/png"
}
]
}]
}
π search
Full-text search within PDFs with surrounding context.
Example & Parameters
{
"sources": [{ "path": "/documents/manual.pdf" }],
"query": "error handling",
"context_chars": 100
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
query | string | Yes | β | Search query |
case_sensitive | boolean | No | false | Case-sensitive search |
max_results | integer | No | 100 | Maximum results to return |
context_chars | integer | No | 50 | Characters of context around match |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
π get_page_info
Get page dimensions, word/char counts, token estimates, and file sizes. Useful for planning LLM context usage.
Example, Parameters & Response
{
"sources": [{ "path": "/documents/report.pdf" }]
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
skip_file_sizes | boolean | No | false | Skip file size calculation (faster) |
Response:
{
"results": [{
"source": "/documents/report.pdf",
"pages": [{
"page": 1,
"width": 612.0, "height": 792.0,
"rotation": 0, "orientation": "portrait",
"char_count": 2500, "word_count": 450,
"estimated_token_count": 625,
"file_size": 102400
}],
"total_pages": 10,
"total_chars": 25000,
"total_words": 4500,
"total_estimated_token_count": 6250
}]
}
Note: Token counts are model-dependent approximations (~4 chars/token for Latin, ~2 tokens/char for CJK). Use as rough guidance only.
π summarize_structure
One-call comprehensive overview of a PDF's structure. Helps LLMs decide how to process a document.
Example, Parameters & Response
{
"sources": [{ "path": "/documents/report.pdf" }]
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources |
password | string | No | β | PDF password if encrypted |
cache | boolean | No | false | Enable caching |
Response:
{
"results": [{
"source": "/documents/report.pdf",
"page_count": 25,
"file_size": 1048576,
"metadata": { "title": "Annual Report", "author": "Acme Corp" },
"has_outline": true,
"outline_items": 12,
"total_chars": 50000,
"total_words": 9000,
"total_estimated_tokens": 12500,
"pages": [
{ "page": 1, "width": 612.0, "height": 792.0, "char_count": 2000, "word_count": 360, "has_images": true, "has_links": false, "has_annotations": false }
],
"total_images": 5,
"total_links": 3,
"total_annotations": 2,
"has_form": false,
"form_field_count": 0,
"form_field_types": {},
"is_encrypted": false
}]
}
π list_pdfs
Discover PDF files in a directory with optional filtering.
Example & Parameters
{
"directory": "/documents",
"recursive": true,
"pattern": "invoice*.pdf"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
directory | string | Yes | β | Directory to search |
recursive | boolean | No | false | Search subdirectories |
pattern | string | No | β | Filename pattern (e.g., "report*.pdf") |
βοΈ split_pdf
Extract specific pages from a PDF to create a new PDF.
Example, Parameters & Page Range Syntax
{
"source": { "path": "/documents/book.pdf" },
"pages": "1-10,15,20-z",
"output_path": "/output/excerpt.pdf"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source | object | Yes | β | PDF source |
pages | string | Yes | β | Page range (see syntax below) |
output_path | string | No | β | Save output to file |
password | string | No | β | PDF password if encrypted |
Page Range Syntax:
| Syntax | Description |
|---|---|
1-5 | Pages 1 through 5 |
1,3,5 | Specific pages |
z | Last page |
r1 | Last page (reverse) |
5-z | Page 5 to end |
z-1 | All pages reversed |
1-z:odd | Odd pages only |
1-z:even | Even pages only |
1-10,x5 | Pages 1β10 except page 5 |
βοΈ merge_pdfs
Merge multiple PDFs into a single file.
Example & Parameters
{
"sources": [
{ "path": "/documents/chapter1.pdf" },
{ "path": "/documents/chapter2.pdf" }
],
"output_path": "/output/complete-book.pdf"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources | array | Yes | β | PDF sources to merge (in order) |
output_path | string | No | β | Save output to file |
βοΈ compress_pdf
Reduce PDF file size using stream optimization, object deduplication, and compression.
Example, Parameters & Response
{
"source": { "path": "/documents/large-report.pdf" },
"compression_level": 9,
"output_path": "/output/compressed.pdf"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source | object | Yes | β | PDF source |
object_streams | string | No | "generate" | "generate" (best) Β· "preserve" Β· "disable" |
compression_level | integer | No | 9 | 1β9 (higher = better compression) |
output_path | string | No | β | Save output to file |
password | string | No | β | PDF password if encrypted |
Response:
{
"results": [{
"source": "/documents/large-report.pdf",
"original_size": 5242880,
"compressed_size": 2097152,
"compression_ratio": 0.4,
"bytes_saved": 3145728
}]
}
βοΈ fill_form
Write values into existing PDF form fields and produce a new PDF.
Example, Parameters & Limitations
{
"source": { "path": "/documents/form.pdf" },
"field_values": [
{ "name": "full_name", "value": "Jane Smith" },
{ "name": "agree_terms", "checked": true }
],
"output_path": "/output/filled-form.pdf"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source | object | Yes | β | PDF source |
field_values | array | Yes | β | Fields to fill (see below) |
output_path | string | No | β | Save output to file |
password | string | No | β | PDF password if encrypted |
Field value format:
| Field | Type | Description |
|---|---|---|
name | string | Field name (use extract_form_fields to discover names) |
value | string | Text value (for text fields) |
checked | boolean | Checked state (for checkbox/radio fields) |
Supported field types: Text fields, checkboxes, radio buttons. ComboBox/ListBox selection is read-only.
π protect_pdf
Add password protection using 256-bit AES encryption.
Example & Parameters
{
"source": { "path": "/documents/confidential.pdf" },
"user_password": "secret123",
"allow_print": "none",
"allow_copy": false
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source | object | Yes | β | PDF source |
user_password | string | Yes | β | Password to open the PDF |
owner_password | string | No | user_password | Password to change permissions |
allow_print | string | No | "full" | "full" Β· "low" Β· "none" |
allow_copy | boolean | No | true | Allow copying text/images |
allow_modify | boolean | No | true | Allow modifying the document |
output_path | string | No | β | Save output to file |
password | string | No | β | Password for source PDF if encrypted |
π unprotect_pdf
Remove password protection from an encrypted PDF.
Example & Parameters
{
"source": { "path": "/documents/protected.pdf" },
"password": "secret123",
"output_path": "/output/unprotected.pdf"
}
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source | object | Yes | β | PDF source |
password | string | Yes | β | Password for the encrypted PDF |
output_path | string | No | β | Save output to file |
π¦ MCP Resources
Expose PDFs from configured directories as MCP Resources for direct client discovery and reading.
Configuration & Details
Enabling Resources
# Command line
pdf-mcp-server --resource-dir /documents --resource-dir /data/pdfs
# Short form
pdf-mcp-server -r /documents -r /data/pdfs
# Environment variable (path-separated: colon on Unix, semicolon on Windows)
PDF_RESOURCE_DIRS=/documents:/data/pdfs pdf-mcp-server
Claude Desktop with resources:
{
"mcpServers": {
"pdf": {
"command": "npx",
"args": ["@paradyno/pdf-mcp-server", "--resource-dir", "/documents"],
"env": {
"PDF_RESOURCE_DIRS": "/data/pdfs:/shared/documents"
}
}
}
}
Both methods can be combined β command line arguments are added to environment variable paths.
Resource URIs
PDFs are exposed with file:// URIs:
file:///documents/report.pdf
file:///documents/2024/invoice.pdf
Operations
resources/listβ Returns all PDFs with URI, name, MIME type, size, and descriptionresources/readβ Returns extracted text content, formatted for LLM consumption
Resources vs Tools vs Caching
| Feature | Purpose | Use Case |
|---|---|---|
| Resources | Passive file discovery | Browse and preview available PDFs |
| Tools | Active PDF processing | Extract, search, manipulate PDFs |
| CacheRef | Tool chaining | Pass output between operations |
π Security
Configuration & Details
Path Sandboxing
When --resource-dir is specified, all file operations (reads and writes) are sandboxed to the configured directories. Path traversal attempts (e.g., ../../etc/passwd) are blocked via canonicalize().
Without --resource-dir, all paths are allowed (backward compatible).
SSRF Protection
URL sources are checked for SSRF by default. URLs that resolve to private or reserved IP addresses are blocked:
- Loopback (
127.0.0.0/8,::1) - Private (
10/8,172.16/12,192.168/16,fc00::/7) - Link-local (
169.254/16,fe80::/10) β blocks cloud metadata endpoints - CGNAT (
100.64/10) - Broadcast, unspecified
Use --allow-private-urls or PDF_ALLOW_PRIVATE_URLS=1 to disable this check (e.g., for local development).
Resource Limits
| Limit | Default | CLI Flag | Env Var |
|---|---|---|---|
| URL download size | 100 MB | --max-download-size | PDF_MAX_DOWNLOAD_BYTES |
| Cache total bytes | 512 MB | --cache-max-bytes | PDF_CACHE_MAX_BYTES |
| Cache entries | 100 | --cache-max-entries | PDF_CACHE_MAX_ENTRIES |
| Image scale factor | 10.0 | --max-image-scale | PDF_MAX_IMAGE_SCALE |
| Image pixel area | 100M | --max-image-pixels | PDF_MAX_IMAGE_PIXELS |
π Caching
When cache: true is specified, the server returns a cache_key for use in subsequent requests:
// Step 1: Extract with caching
{ "sources": [{ "path": "/documents/large.pdf" }], "cache": true }
// Step 2: Use cache_key from response
{ "sources": [{ "cache_key": "a1b2c3d4" }], "pages": "50-60" }
ποΈ Architecture
block-beta
columns 1
block:server["MCP Server (rmcp)"]
columns 3
extract_text search split_pdf
end
block:common["Common Layer"]
columns 3
Cache["Cache Manager"] Source["Source Resolver"] Batch["Batch Executor"]
end
block:pdf["PDF Processing"]
columns 2
PDFium["pdfium-render\n(reading)"] qpdf["qpdf FFI\n(manipulation)"]
end
server --> common --> pdf
β‘ Performance
Benchmarked with a 14-page technical paper (tracemonkey.pdf, ~1 MB) on Docker (Apple Silicon):
| Operation | Time | What it means |
|---|---|---|
| Extract text (14 pages) | 170 ms | Process ~80 documents per minute |
| Metadata only | 0.26 ms | ~4,000 documents per second |
| Search | 0.01 ms | Instant results on extracted text |
| 100 files batch | 4.8 s | ~21 documents per second |
Key takeaways
- Fast enough for interactive use β Text extraction completes in under 200ms
- Metadata is nearly instant β Use
extract_metadataorsummarize_structureto quickly assess documents before full processing - Search is blazing fast β Once text is extracted, searching is essentially free
- Batch processing scales linearly β No significant overhead when processing many files
Run benchmarks yourself:
docker compose --profile dev run --rm bench
π§βπ» Development
Docker (Recommended)
# Build
docker compose --profile dev run --rm dev cargo build
# Run tests
docker compose --profile dev run --rm test
# Run tests with coverage
docker compose --profile dev run --rm coverage
# Format code
docker compose --profile dev run --rm dev cargo fmt --all
# Lint
docker compose --profile dev run --rm clippy
# Performance benchmarks
docker compose --profile dev run --rm bench
# Build production image (~120MB)
docker compose --profile prod build production
# Clean up
docker compose --profile dev down --rmi local
Native Development
Requires PDFium installed locally. Download from pdfium-binaries and set PDFIUM_PATH.
cargo build --release
cargo test
cargo bench
cargo llvm-cov --html
Project Structure
src/
βββ main.rs # Entry point, CLI args
βββ lib.rs # Library root
βββ server.rs # MCP server & tool handlers
βββ error.rs # Error types
βββ pdf/
β βββ reader.rs # PDFium wrapper (text, metadata, outline)
β βββ annotations.rs # Annotation extraction
β βββ images.rs # Image extraction
β βββ qpdf.rs # qpdf FFI (split, merge, encrypt)
βββ source/
βββ resolver.rs # Path/URL/Base64 resolution
βββ cache.rs # LRU caching layer
πΊοΈ Roadmap
Completed Phases
Phase 1: Core Reading β
extract_text Β· extract_outline Β· search Β· extract_metadata Β· extract_annotations Β· Image extraction Β· Batch processing Β· Caching
Phase 2: PDF Manipulation β
split_pdf Β· merge_pdfs Β· protect_pdf Β· unprotect_pdf Β· compress_pdf Β· extract_links Β· get_page_info
Phase 2.5: LLM-Optimized Text β
Dynamic thresholds Β· Paragraph detection Β· Multi-column layout Β· Watermark removal
Phase 2.6: Discovery & Resources β
list_pdfs Β· MCP Resources Β· Resource directory configuration
Phase 2.7: Vision & Forms β
convert_page_to_image Β· extract_form_fields Β· fill_form Β· summarize_structure
Phase 3: Advanced Features (Planned)
rotate_pagesβ Rotate specific pagesextract_tablesβ Structured table extractionadd_watermarkβ Text/image watermarkslinearize_pdfβ Web optimization- OCR support Β· PDF/A validation Β· Digital signature verification
Waiting for MCP Protocol
- Large file upload β MCP lacks a standard API for uploading large files (>20MB). Discussed in #1197, #1220, #1659.
- Chunked file transfer β No standard mechanism exists yet.
Current workarounds: shared filesystem (path), object storage with pre-signed URLs (url), or base64 encoding.
Deferred Features
These provide limited value for LLM use cases:
- Hyphenation merging β LLMs understand hyphenated words
- Fixed-pitch mode β Limited use cases
- Bounding box output β LLMs don't need coordinates
- Invisible text removal β Not supported by pdfium-render API
π License
Apache License 2.0
π Acknowledgments
- PDFium β PDF rendering engine (Apache 2.0)
- pdfium-render β Rust PDFium bindings (Apache 2.0)
- qpdf β PDF transformation library, vendored via FFI (Apache 2.0)
- rmcp β Rust MCP SDK
