📦

Doc Indexer MCP

Local document indexer MCP server for semantic search over PDF, Excel, SQL, Markdown, and HTML files using Qdrant and Voyage AI embeddings.

0 installs

Trust: 34 — Low

Rag

Ask AI about Doc Indexer MCP

I know everything about Doc Indexer MCP. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

doc-indexer-mcp

A local document indexer MCP (Model Context Protocol) server written in Rust. Enables semantic search over PDF, Excel, SQL/PL-SQL, Markdown, and HTML files using Qdrant vector database and Voyage AI embeddings. Designed for integration with Claude Code CLI and other MCP-compatible tools.

基于 Rust 编写的本地文档索引 MCP（模型上下文协议）服务器。使用 Qdrant 向量数据库和 Voyage AI 嵌入模型，支持对 PDF、Excel、SQL/PL-SQL、Markdown 和 HTML 文件进行语义搜索。专为 Claude Code CLI 及其他 MCP 兼容工具集成设计。

Rust で書かれたローカルドキュメントインデクサー MCP（Model Context Protocol）サーバー。Qdrant ベクトルデータベースと Voyage AI エンベディングを使用して、PDF、Excel、SQL/PL-SQL、Markdown、HTML ファイルのセマンティック検索を実現。Claude Code CLI や他の MCP 互換ツールとの統合を想定して設計。

Features

PDF Parsing: Uses pdftotext (poppler) for text extraction with full Unicode support
Excel Parsing: Native Rust parsing via calamine (.xlsx, .xls, .xlsm, .ods)
SQL/PL-SQL Parsing: Extracts procedures, functions, packages, and triggers
Markdown Parsing: Section-aware chunking for documentation
HTML Parsing: Extracts UI text from web application snapshots
Vector Search: Qdrant vector database for semantic similarity search
Embeddings: Voyage AI or OpenAI-compatible embeddings API
MCP Protocol: Full MCP server implementation using rmcp 0.13
Fully Configurable: All settings via environment variables

Prerequisites

Rust 2024 Edition (rustc 1.85+)
Qdrant vector database
pdftotext (from poppler-utils) for PDF parsing
Voyage AI API Key (or OpenAI-compatible endpoint)

Installing Dependencies

# macOS
brew install poppler

# Download Qdrant (macOS ARM64)
curl -LO https://github.com/qdrant/qdrant/releases/download/v1.14.0/qdrant-aarch64-apple-darwin.tar.gz
tar xzf qdrant-aarch64-apple-darwin.tar.gz

Configuration

All settings are configurable via environment variables. Copy .env.example to .env:

# Embedding API Configuration
VOYAGE_API_KEY=your-voyage-api-key
EMBEDDING_MODEL=voyage-3-large

# Vector Database Configuration
QDRANT_URL=http://localhost:6334
QDRANT_COLLECTION=doc_index

# Document Paths Configuration
DOCS_PATH=/path/to/your/documents
INDEX_SUBDIRS=docs

# Chunk Settings
PDF_CHUNK_SIZE=1000
PDF_CHUNK_OVERLAP=200
EXCEL_ROWS_PER_CHUNK=50
SQL_MAX_CHUNK_SIZE=4000

# Search Settings
SEARCH_TOP_K=10

# Logging
RUST_LOG=info

Chunk Size Recommendations

Document Type	Language	Recommended Size
PDF	Japanese	600-800 chars
PDF	English	1000-1500 chars
Test Specifications	Any	1200-1500 chars
SQL Code	Any	4000 chars

Building

# Development build
cargo build

# Release build (optimized)
cargo build --release

Testing

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test module
cargo test parsers::pdf::tests
cargo test parsers::excel::tests
cargo test parsers::sql::tests

Running

Start Qdrant:

./qdrant

Run the MCP server:

cargo run --release

The server communicates via stdio following the MCP protocol.

MCP Tools

Tool	Description
`index_document`	Index a single document file
`index_directory`	Recursively index all supported files in configured subdirectories
`search_documents`	Semantic search across indexed documents
`delete_document`	Remove a document from the index
`get_stats`	Get index statistics

Supported File Types

Extension	Parser	Notes
`.pdf`	pdftotext	Full Unicode support
`.xlsx`, `.xls`, `.xlsm`, `.ods`	calamine	All sheets parsed
`.sql`, `.pls`, `.pks`, `.pkb`	SQL Parser	PL/SQL object extraction
`.md`, `.markdown`	Markdown Parser	Section-aware chunking
`.html`, `.htm`	HTML Parser	UI text extraction

Integration with Claude Code CLI

Step 1: Build the server

cd /path/to/doc-indexer-mcp
cargo build --release

Step 2: Configure Claude Code CLI

Add the MCP server to your Claude Code configuration file ~/.claude.json:

{
  "mcpServers": {
    "doc-indexer": {
      "command": "/path/to/doc-indexer-mcp/target/release/doc-indexer-mcp",
      "env": {
        "VOYAGE_API_KEY": "your-voyage-api-key",
        "EMBEDDING_MODEL": "voyage-3-large",
        "QDRANT_URL": "http://localhost:6334",
        "QDRANT_COLLECTION": "doc_index",
        "DOCS_PATH": "/path/to/your/documents",
        "INDEX_SUBDIRS": "docs",
        "PDF_CHUNK_SIZE": "1000",
        "PDF_CHUNK_OVERLAP": "200",
        "RUST_LOG": "info"
      }
    }
  }
}

Step 3: Test with Claude Code

Use the /mcp command in Claude Code to test your MCP server:

claude
> /mcp

This will show all available MCP tools. You can then test individual tools:

> Search for "user authentication" in the indexed documents
> Index all documents in the docs folder

Step 4: Project-specific settings (optional)

Create a settings.json in your project root for project-specific permissions:

{
  "permissions": {
    "allow": [
      "mcp__doc-indexer__index_document",
      "mcp__doc-indexer__index_directory",
      "mcp__doc-indexer__search_documents",
      "mcp__doc-indexer__get_stats",
      "mcp__doc-indexer__delete_document"
    ]
  }
}

Directory Structure for DOCS_PATH

Organize your documents in the configured subdirectories:

/your/docs/path/
├── docs/                    # Design documents, specifications
│   ├── design_spec.pdf
│   ├── test_spec.pdf
│   └── schema.md
└── sql/                     # SQL and PL/SQL files
    ├── procedures.sql
    └── packages.pkb

Architecture

src/
├── main.rs              # Entry point
├── config.rs            # Configuration from environment
├── embedding/
│   └── client.rs        # Embeddings API client (Voyage AI)
├── mcp/
│   ├── server.rs        # MCP server setup
│   └── tools.rs         # Tool implementations
├── parsers/
│   ├── mod.rs           # Parser trait and common types
│   ├── pdf.rs           # PDF parser (pdftotext)
│   ├── excel.rs         # Excel parser (calamine)
│   ├── sql.rs           # SQL/PL-SQL parser
│   ├── markdown.rs      # Markdown parser
│   └── html.rs          # HTML parser
└── vector_store/
    └── qdrant.rs        # Qdrant vector database client

Customizing Chunking Logic

Each parser in src/parsers/ implements intelligent chunking for its document type. You can customize the chunking behavior by modifying the section markers and patterns.

PDF Parser (`src/parsers/pdf.rs`)

The PDF parser uses section markers to split documents into logical chunks:

// Major section markers - customize for your document format
const MAJOR_SECTION_MARKERS: &[&str] = &[
    "【Initial Display】", "【On Display】", "【On Save】",
    // Add your own section markers here
];

// Sub-section headers
const SUB_SECTION_HEADERS: &[&str] = &[
    "Action Definition", "Screen Definition", "Error Check",
    // Add your own sub-section patterns
];

Key functions to customize:

classify_line() - Determines line type (section header, content, etc.)
should_start_new_block() - Decides chunk boundaries
split_into_blocks() - Main chunking logic

Excel Parser (`src/parsers/excel.rs`)

The Excel parser handles structured documents with tables and nested sections:

// Bracketed section markers
const MAJOR_SECTION_MARKERS: &[&str] = &[
    "【Initial Display】", "【Data Items】", "【Conditions】",
    // Add markers matching your Excel templates
];

// Row type classification
enum RowType {
    BracketedSection,    // 【Section】
    MajorSection,        // 1. Section
    SubSection,          // 1.1. Sub Section
    TableHeader,         // No | Item Name | ...
    // Add custom row types
}

Key functions to customize:

classify_row() - Classifies Excel rows by type
should_start_new_block() - Determines chunk boundaries
rows_to_markdown() - Converts rows to searchable text

HTML Parser (`src/parsers/html.rs`)

The HTML parser extracts UI text from web application snapshots:

// CSS class patterns to extract text from
let patterns = [
    ("title", "ui-dialog-title"),
    ("button", "a-Button-label"),
    ("column", "a-GV-headerLabel"),
    // Add patterns matching your UI framework
];

Key functions to customize:

detect_component_type() - Identifies UI component types
extract_texts() - Extracts text by CSS class patterns

SQL Parser (`src/parsers/sql.rs`)

The SQL parser extracts PL/SQL objects (procedures, functions, packages):

Key functions to customize:

Object detection patterns for your database schema
Package/procedure boundary detection

Adding a New Parser

Create a new file in src/parsers/ (e.g., xml.rs)
Implement the DocumentParser trait:

#[async_trait::async_trait]
impl DocumentParser for XmlParser {
    async fn parse(&self, file_path: &str) -> Result<Vec<DocumentChunk>> {
        // Your parsing logic here
    }

    fn supported_extensions(&self) -> Vec<&'static str> {
        vec!["xml"]
    }
}

Register in src/parsers/mod.rs
Add to src/mcp/tools.rs in get_parser()

Troubleshooting

No logs visible in Claude Code CLI

The MCP server logs to stderr, which may not be visible in Claude Code CLI. To debug:

Set RUST_LOG=debug in your configuration

Run the server manually to see logs:

RUST_LOG=debug ./target/release/doc-indexer-mcp

Qdrant connection issues

Ensure Qdrant is running on the configured port (default: 6334):

./qdrant
# Check: curl http://localhost:6334/collections

PDF parsing errors

Ensure pdftotext is installed:

which pdftotext
# If not found: brew install poppler

Testing MCP connection

Use Claude Code's /mcp command to verify the server is connected:

claude
> /mcp

This will list all available MCP servers and their tools.

License

MIT License - see LICENSE file.

Doc Indexer MCP

Reviews

Documentation

doc-indexer-mcp

Features

Prerequisites

Installing Dependencies

Configuration

Chunk Size Recommendations

Building

Testing

Running

MCP Tools

Supported File Types

Integration with Claude Code CLI

Step 1: Build the server

Step 2: Configure Claude Code CLI

Step 3: Test with Claude Code

Step 4: Project-specific settings (optional)

Directory Structure for DOCS_PATH

Architecture

Customizing Chunking Logic

PDF Parser (src/parsers/pdf.rs)

Excel Parser (src/parsers/excel.rs)

HTML Parser (src/parsers/html.rs)

SQL Parser (src/parsers/sql.rs)

Adding a New Parser

Troubleshooting

No logs visible in Claude Code CLI

Qdrant connection issues

PDF parsing errors

Testing MCP connection

License

Security Checklist

PDF Parser (`src/parsers/pdf.rs`)

Excel Parser (`src/parsers/excel.rs`)

HTML Parser (`src/parsers/html.rs`)

SQL Parser (`src/parsers/sql.rs`)