Doc Indexer MCP
Local document indexer MCP server for semantic search over PDF, Excel, SQL, Markdown, and HTML files using Qdrant and Voyage AI embeddings.
Ask AI about Doc Indexer MCP
Powered by Claude · Grounded in docs
I know everything about Doc Indexer MCP. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
doc-indexer-mcp
A local document indexer MCP (Model Context Protocol) server written in Rust. Enables semantic search over PDF, Excel, SQL/PL-SQL, Markdown, and HTML files using Qdrant vector database and Voyage AI embeddings. Designed for integration with Claude Code CLI and other MCP-compatible tools.
基于 Rust 编写的本地文档索引 MCP(模型上下文协议)服务器。使用 Qdrant 向量数据库和 Voyage AI 嵌入模型,支持对 PDF、Excel、SQL/PL-SQL、Markdown 和 HTML 文件进行语义搜索。专为 Claude Code CLI 及其他 MCP 兼容工具集成设计。
Rust で書かれたローカルドキュメントインデクサー MCP(Model Context Protocol)サーバー。Qdrant ベクトルデータベースと Voyage AI エンベディングを使用して、PDF、Excel、SQL/PL-SQL、Markdown、HTML ファイルのセマンティック検索を実現。Claude Code CLI や他の MCP 互換ツールとの統合を想定して設計。
Features
- PDF Parsing: Uses
pdftotext(poppler) for text extraction with full Unicode support - Excel Parsing: Native Rust parsing via
calamine(.xlsx, .xls, .xlsm, .ods) - SQL/PL-SQL Parsing: Extracts procedures, functions, packages, and triggers
- Markdown Parsing: Section-aware chunking for documentation
- HTML Parsing: Extracts UI text from web application snapshots
- Vector Search: Qdrant vector database for semantic similarity search
- Embeddings: Voyage AI or OpenAI-compatible embeddings API
- MCP Protocol: Full MCP server implementation using
rmcp 0.13 - Fully Configurable: All settings via environment variables
Prerequisites
- Rust 2024 Edition (rustc 1.85+)
- Qdrant vector database
- pdftotext (from poppler-utils) for PDF parsing
- Voyage AI API Key (or OpenAI-compatible endpoint)
Installing Dependencies
# macOS
brew install poppler
# Download Qdrant (macOS ARM64)
curl -LO https://github.com/qdrant/qdrant/releases/download/v1.14.0/qdrant-aarch64-apple-darwin.tar.gz
tar xzf qdrant-aarch64-apple-darwin.tar.gz
Configuration
All settings are configurable via environment variables. Copy .env.example to .env:
# Embedding API Configuration
VOYAGE_API_KEY=your-voyage-api-key
EMBEDDING_MODEL=voyage-3-large
# Vector Database Configuration
QDRANT_URL=http://localhost:6334
QDRANT_COLLECTION=doc_index
# Document Paths Configuration
DOCS_PATH=/path/to/your/documents
INDEX_SUBDIRS=docs
# Chunk Settings
PDF_CHUNK_SIZE=1000
PDF_CHUNK_OVERLAP=200
EXCEL_ROWS_PER_CHUNK=50
SQL_MAX_CHUNK_SIZE=4000
# Search Settings
SEARCH_TOP_K=10
# Logging
RUST_LOG=info
Chunk Size Recommendations
| Document Type | Language | Recommended Size |
|---|---|---|
| Japanese | 600-800 chars | |
| English | 1000-1500 chars | |
| Test Specifications | Any | 1200-1500 chars |
| SQL Code | Any | 4000 chars |
Building
# Development build
cargo build
# Release build (optimized)
cargo build --release
Testing
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test module
cargo test parsers::pdf::tests
cargo test parsers::excel::tests
cargo test parsers::sql::tests
Running
- Start Qdrant:
./qdrant
- Run the MCP server:
cargo run --release
The server communicates via stdio following the MCP protocol.
MCP Tools
| Tool | Description |
|---|---|
index_document | Index a single document file |
index_directory | Recursively index all supported files in configured subdirectories |
search_documents | Semantic search across indexed documents |
delete_document | Remove a document from the index |
get_stats | Get index statistics |
Supported File Types
| Extension | Parser | Notes |
|---|---|---|
.pdf | pdftotext | Full Unicode support |
.xlsx, .xls, .xlsm, .ods | calamine | All sheets parsed |
.sql, .pls, .pks, .pkb | SQL Parser | PL/SQL object extraction |
.md, .markdown | Markdown Parser | Section-aware chunking |
.html, .htm | HTML Parser | UI text extraction |
Integration with Claude Code CLI
Step 1: Build the server
cd /path/to/doc-indexer-mcp
cargo build --release
Step 2: Configure Claude Code CLI
Add the MCP server to your Claude Code configuration file ~/.claude.json:
{
"mcpServers": {
"doc-indexer": {
"command": "/path/to/doc-indexer-mcp/target/release/doc-indexer-mcp",
"env": {
"VOYAGE_API_KEY": "your-voyage-api-key",
"EMBEDDING_MODEL": "voyage-3-large",
"QDRANT_URL": "http://localhost:6334",
"QDRANT_COLLECTION": "doc_index",
"DOCS_PATH": "/path/to/your/documents",
"INDEX_SUBDIRS": "docs",
"PDF_CHUNK_SIZE": "1000",
"PDF_CHUNK_OVERLAP": "200",
"RUST_LOG": "info"
}
}
}
}
Step 3: Test with Claude Code
Use the /mcp command in Claude Code to test your MCP server:
claude
> /mcp
This will show all available MCP tools. You can then test individual tools:
> Search for "user authentication" in the indexed documents
> Index all documents in the docs folder
Step 4: Project-specific settings (optional)
Create a settings.json in your project root for project-specific permissions:
{
"permissions": {
"allow": [
"mcp__doc-indexer__index_document",
"mcp__doc-indexer__index_directory",
"mcp__doc-indexer__search_documents",
"mcp__doc-indexer__get_stats",
"mcp__doc-indexer__delete_document"
]
}
}
Directory Structure for DOCS_PATH
Organize your documents in the configured subdirectories:
/your/docs/path/
├── docs/ # Design documents, specifications
│ ├── design_spec.pdf
│ ├── test_spec.pdf
│ └── schema.md
└── sql/ # SQL and PL/SQL files
├── procedures.sql
└── packages.pkb
Architecture
src/
├── main.rs # Entry point
├── config.rs # Configuration from environment
├── embedding/
│ └── client.rs # Embeddings API client (Voyage AI)
├── mcp/
│ ├── server.rs # MCP server setup
│ └── tools.rs # Tool implementations
├── parsers/
│ ├── mod.rs # Parser trait and common types
│ ├── pdf.rs # PDF parser (pdftotext)
│ ├── excel.rs # Excel parser (calamine)
│ ├── sql.rs # SQL/PL-SQL parser
│ ├── markdown.rs # Markdown parser
│ └── html.rs # HTML parser
└── vector_store/
└── qdrant.rs # Qdrant vector database client
Customizing Chunking Logic
Each parser in src/parsers/ implements intelligent chunking for its document type. You can customize the chunking behavior by modifying the section markers and patterns.
PDF Parser (src/parsers/pdf.rs)
The PDF parser uses section markers to split documents into logical chunks:
// Major section markers - customize for your document format
const MAJOR_SECTION_MARKERS: &[&str] = &[
"【Initial Display】", "【On Display】", "【On Save】",
// Add your own section markers here
];
// Sub-section headers
const SUB_SECTION_HEADERS: &[&str] = &[
"Action Definition", "Screen Definition", "Error Check",
// Add your own sub-section patterns
];
Key functions to customize:
classify_line()- Determines line type (section header, content, etc.)should_start_new_block()- Decides chunk boundariessplit_into_blocks()- Main chunking logic
Excel Parser (src/parsers/excel.rs)
The Excel parser handles structured documents with tables and nested sections:
// Bracketed section markers
const MAJOR_SECTION_MARKERS: &[&str] = &[
"【Initial Display】", "【Data Items】", "【Conditions】",
// Add markers matching your Excel templates
];
// Row type classification
enum RowType {
BracketedSection, // 【Section】
MajorSection, // 1. Section
SubSection, // 1.1. Sub Section
TableHeader, // No | Item Name | ...
// Add custom row types
}
Key functions to customize:
classify_row()- Classifies Excel rows by typeshould_start_new_block()- Determines chunk boundariesrows_to_markdown()- Converts rows to searchable text
HTML Parser (src/parsers/html.rs)
The HTML parser extracts UI text from web application snapshots:
// CSS class patterns to extract text from
let patterns = [
("title", "ui-dialog-title"),
("button", "a-Button-label"),
("column", "a-GV-headerLabel"),
// Add patterns matching your UI framework
];
Key functions to customize:
detect_component_type()- Identifies UI component typesextract_texts()- Extracts text by CSS class patterns
SQL Parser (src/parsers/sql.rs)
The SQL parser extracts PL/SQL objects (procedures, functions, packages):
Key functions to customize:
- Object detection patterns for your database schema
- Package/procedure boundary detection
Adding a New Parser
- Create a new file in
src/parsers/(e.g.,xml.rs) - Implement the
DocumentParsertrait:
#[async_trait::async_trait]
impl DocumentParser for XmlParser {
async fn parse(&self, file_path: &str) -> Result<Vec<DocumentChunk>> {
// Your parsing logic here
}
fn supported_extensions(&self) -> Vec<&'static str> {
vec!["xml"]
}
}
- Register in
src/parsers/mod.rs - Add to
src/mcp/tools.rsinget_parser()
Troubleshooting
No logs visible in Claude Code CLI
The MCP server logs to stderr, which may not be visible in Claude Code CLI. To debug:
- Set
RUST_LOG=debugin your configuration - Run the server manually to see logs:
RUST_LOG=debug ./target/release/doc-indexer-mcp
Qdrant connection issues
Ensure Qdrant is running on the configured port (default: 6334):
./qdrant
# Check: curl http://localhost:6334/collections
PDF parsing errors
Ensure pdftotext is installed:
which pdftotext
# If not found: brew install poppler
Testing MCP connection
Use Claude Code's /mcp command to verify the server is connected:
claude
> /mcp
This will list all available MCP servers and their tools.
License
MIT License - see LICENSE file.
