RAG In A Box
RAG pipeline for documents (Markdown, PDF, images). 10-step hybrid search, LLM enrichment, taxonomy, MCP server. Cloud or local.
Ask AI about RAG In A Box
Powered by Claude Β· Grounded in docs
I know everything about RAG In A Box. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
RAG In A Box
Drop your documents into a folder, run the indexer, and get a production-grade RAG pipeline with an MCP server β any MCP-compatible AI assistant (Claude Code, OpenClaw, Claude Desktop, Cursor, etc.) can search your documents with a single config entry.
No infrastructure to manage. No GPU required. Works with cloud APIs out of the box or fully self-hosted.
Use cases
- Personal knowledge base β Index your notes, PDFs, documents, images, audio, and video. Ask your AI assistant questions and get answers grounded in your own files.
- Company document search β Drop legal contracts, reports, SOPs into a folder. Employees search via any MCP-compatible assistant with metadata filters (by department, doc type, date, tags).
- Research assistant β Index papers, datasets, and notes. Search by meaning, not just keywords. LLM enrichment auto-extracts entities, topics, and key facts.
- Obsidian / Markdown vault β Works with any markdown source (Obsidian, HackMD, Notion exports, GitBook). Extracts YAML frontmatter for rich filtering.
- PDF-heavy workflows β Scanned PDFs get OCR automatically. Page-aware chunking keeps context intact. Metadata (author, dates, page count) extracted from PDF properties.
- Multi-agent tool β Expose your document collection as 16 MCP tools. Multiple agents can search, browse, filter, and manage taxonomy concurrently.
Why this over other RAG tools?
| Capability | RAG In A Box | Typical RAG |
|---|---|---|
| Search quality | 10-step hybrid pipeline (vector + BM25 + reranker + MMR) | Vector-only or basic hybrid |
| Document understanding | LLM enrichment extracts summary, entities, topics, importance | Raw chunks, no enrichment |
| Filtering | Pre-filter by tags, folder, doc type, topics, custom fields | Post-filter or none |
| Chunking | Heading-aware (MD) + page-aware (PDF) + semantic boundary detection | Fixed-size windows |
| Chunk context | Each chunk gets title, path, topics prepended for self-describing retrieval | Chunks lose document context |
| Metadata | YAML frontmatter auto-extracted, custom fields auto-promoted to filters | Manual schema setup |
| Taxonomy | Controlled vocabulary with semantic matching, managed via MCP tools | None |
| OCR | Built-in for scanned PDFs and images (cloud or local) | Separate pipeline needed |
| Deployment | Single container, cloud APIs, no GPU | Often needs GPU or complex infra |
| Integration | MCP server (16 tools) β works with Claude, Cursor, any MCP client | Custom API or SDK |
| Resilience | Per-query diagnostics, auto-recovery from DB corruption, structured errors | Silent failures |
Stack
| Component | Provider |
|---|---|
| Embeddings | Qwen3-Embedding-8B via OpenRouter |
| LLM enrichment | GPT-4.1 Mini via OpenRouter |
| OCR | Gemini Vision (cloud) or DeepSeek OCR2 (local) |
| Reranker | Qwen3-Reranker-8B via DeepInfra |
| Vector + FTS | LanceDB + tantivy (BM25) |
| Orchestration | Prefect 3.x |
Getting started
1. Install
git clone https://github.com/DevNexsler/RAG-In-A-Box.git
cd RAG-In-A-Box
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
2. Configure
Copy the example config and edit two paths:
cp config.yaml.example config.yaml # cloud providers (default)
Open config.yaml and set:
documents_rootβ path to your document collectionindex_rootβ where the index will be stored
Self-hosting? Use
cp config.local.yaml.example config.yamlinstead. This config uses Ollama, DeepSeek OCR2, and llama-server β see Local mode below.
3. Add API keys
Create a .env file in the project root:
GEMINI_API_KEY=... # OCR β get one at https://aistudio.google.com/apikey
OPENROUTER_API_KEY=sk-or-... # embeddings + enrichment β https://openrouter.ai/keys
DEEPINFRA_API_KEY=... # reranker β https://deepinfra.com/dash/api_keys
4. Build the index
python run_index.py
This scans your documents, extracts text (Markdown, PDFs, images, audio, video), generates embeddings, and writes everything to a LanceDB index. Prefect auto-starts a temporary server for flow/task logging β dashboard at http://127.0.0.1:4200.
5. Connect your AI assistant
The MCP server gives any compatible AI assistant access to your documents via tools like file_search, file_status, and file_recent. The assistant launches the server automatically β you just add a config entry.
Claude Code
Add to your project's .mcp.json (or ~/.claude.json for global access):
{
"mcpServers": {
"doc-organizer": {
"command": "/path/to/Document-Organizer/.venv/bin/python",
"args": ["/path/to/Document-Organizer/mcp_server.py"],
"cwd": "/path/to/Document-Organizer"
}
}
}
OpenClaw
Add to your OpenClaw MCP config:
{
"mcpServers": {
"doc-organizer": {
"command": "/path/to/Document-Organizer/.venv/bin/python",
"args": ["mcp_server.py"],
"cwd": "/path/to/Document-Organizer"
}
}
}
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):
{
"mcpServers": {
"doc-organizer": {
"command": "/path/to/Document-Organizer/.venv/bin/python",
"args": ["/path/to/Document-Organizer/mcp_server.py"],
"cwd": "/path/to/Document-Organizer"
}
}
}
Any MCP-compatible client
The pattern is the same for Cursor, Windsurf, or any tool that supports MCP stdio servers. Point command at the venv Python and args at mcp_server.py. API keys are loaded from the .env file automatically β no need to pass them in the MCP config.
HTTP mode (remote / non-MCP clients)
python mcp_server.py --http
# Listens on 0.0.0.0:7788
VPS / Docker deployment
Run as a standalone HTTP server on any VPS or container platform. All ML inference uses cloud APIs β no GPU needed.
# Docker
docker build -t doc-organizer .
docker run -v /path/to/data:/data -p 7788:7788 \
-e OPENROUTER_API_KEY=... \
-e DEEPINFRA_API_KEY=... \
-e API_KEY=your-secret-token \
doc-organizer
# Or run directly
API_KEY=your-secret-token python server.py
Environment variable overrides (for container/VPS use):
| Variable | Description |
|---|---|
DOCUMENTS_ROOT | Override documents path (default: from config.yaml) |
INDEX_ROOT | Override index path (default: from config.yaml) |
PORT | Server port (default: 7788) |
API_KEY | Bearer token for HTTP auth. No auth when unset. |
When API_KEY is set, all HTTP requests must include Authorization: Bearer <API_KEY>. See config.vps.yaml.example for VPS-specific config.
Render.com: One-click deploy with render.yaml β persistent disk at /data, auto-generated API key.
REST API (file management)
When running in HTTP mode (--http or server.py), a REST API is available alongside the MCP server for uploading, downloading, and listing documents. Auth uses the same API_KEY bearer token.
Upload a file:
curl -X POST http://localhost:7788/api/upload \
-H "Authorization: Bearer $API_KEY" \
-F "file=@report.pdf" \
-F "directory=2-Area/Legal"
# -> {"uploaded": true, "doc_id": "2-Area/Legal/report.pdf", "size": 84521}
Download a file:
curl http://localhost:7788/api/documents/2-Area/Legal/report.pdf \
-H "Authorization: Bearer $API_KEY" -o report.pdf
List files in a directory:
curl "http://localhost:7788/api/documents/?directory=2-Area&limit=50" \
-H "Authorization: Bearer $API_KEY"
# -> {"directory": "2-Area", "files": [...], "total": 12, "offset": 0, "limit": 50}
| Endpoint | Method | Description |
|---|---|---|
/api/upload | POST | Upload a file (multipart form: file + optional directory) |
/api/documents/{doc_id} | GET | Download a file by path |
/api/documents/ | GET | List files (query params: directory, limit, offset) |
Constraints: Max upload 100 MB. Allowed types: .md, .pdf, .png, .jpg, .jpeg. Path traversal is blocked. After uploading, run file_index_update (via MCP) to index the new document.
Local mode (optional)
To run everything on your own hardware instead of cloud APIs:
- Copy
config.local.yaml.exampletoconfig.yaml - Install and start Ollama:
brew install ollama && ollama serveollama pull qwen3-embedding:0.6b(semantic chunking)ollama pull qwen3-embedding:4b-q8_0(embeddings)
- Start DeepSeek OCR2 on port 8790 (for PDF/image OCR)
No cloud API keys needed in local mode (except reranker β DeepInfra is always cloud).
Features
Document Collection AI Assistants
+------------------+ +-------------------+
| Markdown (.md) | | Claude Code |
| PDFs | +---------+ | OpenClaw |
| Images (.png/jpg)|βββ>| Indexer | | Claude Desktop |
+------------------+ +----+----+ | Cursor / Windsurf |
| +--------+----------+
v |
+----------+----------+ | MCP (stdio)
| LanceDB Index | |
| vectors + metadata |<------+
| + full-text (BM25) | file_search
+---------------------+ file_status
file_recent ...
Hybrid search β Every query runs vector (semantic) and keyword (BM25) search in parallel, fuses results with Reciprocal Rank Fusion, applies length normalization, importance weighting, optional recency boost with time decay floor, cross-encoder reranking (60/40 blend with cosine fallback), MMR diversity filtering, and minimum score thresholding. Pre-filters (tags, folders, doc type, topics, and complex JSON filters) apply at the database level before retrieval so every result matches.
Multi-format extraction β Indexes Markdown, PDFs, images, audio, and video. PDFs use text extraction first, falling back to OCR for scanned pages. Images get OCR text plus visual descriptions. Audio/video files are base64-sent to OpenRouter-compatible media models for transcript/search notes. EXIF metadata (camera, GPS, dates) is extracted automatically.
LLM enrichment β Each document is analyzed by an LLM to extract structured metadata: summary, document type, entities (people, places, orgs, dates), topics, keywords, key facts, suggested tags, and suggested folder. All fields are searchable and filterable.
Taxonomy system β A controlled vocabulary of tags and folder paths stored in a separate LanceDB table with embedded descriptions. The LLM uses the taxonomy during enrichment to suggest consistent tags and filing locations. Seeded from existing tag/directory databases. Managed via 7 MCP CRUD tools (file_taxonomy_*).
Smart chunking β Markdown is split by headings, PDFs by pages. Large sections get semantic chunking (topic-boundary detection via sentence embeddings). Every chunk gets a contextual header prepended with its title, path, and topics β so each chunk is self-describing for better retrieval.
Rich metadata & filtering β YAML frontmatter (tags, status, author, dates, custom fields) is automatically extracted and promoted to filterable columns. Custom frontmatter keys are auto-promoted β no schema changes needed. file_search supports exact filters plus complex JSON filters with eq, ne, contains, prefix, in, and, or, and not.
MCP server β Exposes 16 tools over the Model Context Protocol. Any MCP-compatible assistant can search, browse, filter your documents, and manage taxonomy entries. Works over stdio (launched automatically by the assistant) or HTTP.
Incremental updates β Only new and modified files are processed on re-index. Deleted files are cleaned up automatically. Failed documents are tracked and retried.
Cloud or local β Every component (OCR, embeddings, enrichment, reranker) has both cloud and local provider options. Default config uses cloud APIs with no servers to run. Switch to fully self-hosted with a single config file swap.
Resilient by default β Per-document error handling with retries, structured MCP error responses, search diagnostics on every query (vector_search_active, reranker_applied, degraded), SQL injection protection on filter keys, and automatic LanceDB corruption recovery (version rollback + rebuild).
MCP tools
| Tool | Description |
|---|---|
file_search | Hybrid semantic + keyword search with exact filters plus complex JSON filters (and/or/not, in, contains, etc.) |
file_get_chunk | Get full text + metadata for one chunk by doc_id and loc |
file_get_doc_chunks | Get all chunks for a document, sorted by position |
file_list_documents | Browse all indexed documents with pagination and filters |
file_recent | Recently modified/indexed docs (newest first) |
file_facets | Distinct values + counts for all filterable fields |
file_folders | Document folder/directory structure with file counts |
file_status | Index stats, provider settings, health checks |
file_index_update | Incrementally update the index without leaving the assistant |
file_taxonomy_list | List taxonomy entries (tags, folders, doc_types) with filters |
file_taxonomy_get | Get a single taxonomy entry by id |
file_taxonomy_search | Semantic search on taxonomy descriptions |
file_taxonomy_add | Add a new taxonomy entry |
file_taxonomy_update | Update an existing taxonomy entry |
file_taxonomy_delete | Delete a taxonomy entry |
file_taxonomy_import | Import taxonomy from SQLite seed databases |
Run tests
python -m pytest tests/ -m "not live" -x # ~370 offline tests (no API keys)
python -m pytest tests/ -x # ~454 full suite (requires API keys)
Project layout
core/ Config, storage interface, taxonomy helpers
providers/embed/ Embedding providers (OpenRouter, Ollama, LlamaIndex)
providers/llm/ LLM providers (OpenRouter, Ollama)
providers/ocr/ OCR providers (Gemini Vision, DeepSeek OCR2)
taxonomy_store.py Taxonomy LanceDB store (CRUD, vector search, FTS)
doc_enrichment.py LLM metadata extraction (with taxonomy integration)
extractors.py Text extraction (MD, PDF, images, audio/video)
flow_index_vault.py Prefect indexing flow
lancedb_store.py LanceDB storage + search
search_hybrid.py 10-step hybrid search pipeline
mcp_server.py MCP server (stdio + HTTP, 16 tools)
server.py VPS entrypoint β starts HTTP server on $PORT
run_index.py CLI entrypoint
scripts/seed_taxonomy.py Import taxonomy from existing SQLite DBs
config.yaml.example Cloud config template
config.local.yaml.example Local/self-hosted config template
config.vps.yaml.example VPS/container config template
Dockerfile Docker image (Python 3.13-slim, no GPU)
.dockerignore Docker build exclusions
render.yaml Render.com deployment descriptor
tests/ ~454 tests
docs/architecture.md Search pipeline, schema, component details
docs/vps-architecture.md VPS/cloud deployment architecture
License
PolyForm Noncommercial 1.0.0 β free for personal, research, educational, and nonprofit use. Commercial use requires a separate license.
