io.github.HaseebKhalid1507/velocirag
Lightning-fast RAG for AI agents. 4-layer fusion, ONNX Runtime, sub-200ms search.
Ask AI about io.github.HaseebKhalid1507/velocirag
Powered by Claude Β· Grounded in docs
I know everything about io.github.HaseebKhalid1507/velocirag. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
π¦ VelociRAG
Lightning-fast RAG for AI agents.
Four-layer retrieval fusion powered by ONNX Runtime. No PyTorch. Sub-200ms warm search. Incremental graph updates. MCP-ready.
Most RAG solutions either drag in 2GB+ of PyTorch or limit you to single-layer vector search. VelociRAG gives you four retrieval methods β vector similarity, BM25 keyword matching, knowledge graph traversal, and metadata filtering β fused through reciprocal rank fusion with cross-encoder reranking. All running on ONNX Runtime, no GPU, no API keys. Comes with an MCP server for agent integration, a Unix socket daemon for warm queries, and a CLI that just works.
π Quick Start
MCP Server (Claude, Cursor, Windsurf)
pip install "velocirag[mcp]"
velocirag index ./my-docs
velocirag mcp
Claude Code β add to .mcp.json in your project root:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp"],
"env": { "VELOCIRAG_DB": "/path/to/data" }
}
}
}
Then open /mcp in Claude Code and enable the velocirag server. If using a virtualenv, use the full path to the binary (e.g. .venv/bin/velocirag).
Claude Desktop β add to claude_desktop_config.json:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp", "--db", "/path/to/data"]
}
}
}
Cursor β add to .cursor/mcp.json:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp", "--db", "/path/to/data"]
}
}
}
Python API
from velocirag import Embedder, VectorStore, Searcher
embedder = Embedder()
store = VectorStore('./my-db', embedder)
store.add_directory('./my-docs')
searcher = Searcher(store, embedder)
results = searcher.search('query', limit=5)
CLI
pip install velocirag
velocirag index ./my-docs
velocirag search "your query here"
Search Daemon (warm engine for CLI users)
velocirag serve --db ./my-data # start daemon (background)
velocirag search "query" # auto-routes through daemon
velocirag status # check daemon health
velocirag stop # stop daemon
The daemon keeps the ONNX model + FAISS index warm over a Unix socket. First query loads the engine (~1s), subsequent queries return in ~180ms with full 4-layer fusion.
π― Why VelociRAG?
- 4-layer search β vector + BM25 keyword + knowledge graph + metadata, fused with RRF
- No LLM needed β search runs entirely on local models (MiniLM + TinyBERT, ~80MB total)
- No GPU needed β pure ONNX inference, runs on any machine
- ~3ms warm search β daemon keeps models + indices warm over Unix socket
- Incremental indexing β add files without rebuilding the whole index
- MCP server β plug into Claude, Cursor, Windsurf, any MCP client
Related Projects
- Memkoshi β Agent memory system. Uses VelociRAG as its search engine.
- Stelline β Session intelligence. Crafts memories from conversation logs.
- Glyph β MCP security scanner and runtime protection.
ποΈ How It Works
The 4-layer pipeline:
Query β expand (acronyms, variants)
β [Vector] FAISS cosine similarity (384d, MiniLM-L6-v2 via ONNX)
β [Keyword] BM25 via SQLite FTS5
β [Graph] Knowledge graph traversal
β [Metadata] Structured SQL filters (tags, status, project)
β RRF Fusion β Cross-encoder rerank β Results
What each layer catches:
| Query type | Vector | Keyword | Graph | Metadata |
|---|---|---|---|---|
| Conceptual ("improve error handling") | β | β | β | β |
| Exact match ("ERR_CONNECTION_REFUSED") | β | β | β | β |
| Connected concepts | β | β | β | β |
| Filtered ("#python status:active") | β | β | β | β |
| Combined ("React state management") | β | β | β | β |
β¨ Features
- ONNX Runtime β 184ms cold start, 3ms cached. No PyTorch, no GPU
- Four-layer fusion β FAISS vector similarity + SQLite FTS5 (BM25) + knowledge graph + metadata filtering, merged via reciprocal rank fusion
- Cross-encoder reranking β TinyBERT reranker via ONNX Runtime β included in base install, no PyTorch needed. Downloads ~17MB model on first use
- Incremental graph updates β file-centric provenance tracking detects what changed and only rebuilds affected nodes/edges. Cascading deletes maintain consistency across all stores (vector, graph, metadata). Multi-source support with isolated provenance per source
- MCP server β Five tools (search, index, add_document, health, list_sources) for Claude, Cursor, Windsurf
- Search daemon β Unix socket server keeps ONNX model + FAISS index warm between queries
- Knowledge graph β Analyzers build entity, temporal, topic, and explicit-link edges from markdown. Optional GLiNER NER. 418 files in 2.1s
- Smart chunking β Header-aware splitting preserves document structure and parent context
- Query expansion β Acronym registry, casing/spacing variants, underscore-aware tokenization
- Runs anywhere β CPU-only, 8GB RAM, no API keys, no external services
π€ MCP Server
VelociRAG exposes a Model Context Protocol server for seamless agent integration:
Available tools:
searchβ 4-layer fusion search with rerankingindexβ Add documents to the knowledge baseadd_documentβ Insert single documenthealthβ System diagnosticslist_sourcesβ Show indexed document sources
The MCP server process stays alive between queries, so models load once and every subsequent search is warm. Works with any MCP-compatible client.
π Python API
Full 4-layer unified search:
from velocirag import (
Embedder, VectorStore, Searcher,
GraphStore, MetadataStore, UnifiedSearch,
GraphPipeline
)
# Build the full stack
embedder = Embedder()
store = VectorStore('./search-db', embedder)
graph_store = GraphStore('./search-db/graph.db')
metadata_store = MetadataStore('./search-db/metadata.db')
# Index with graph + metadata
store.add_directory('./docs')
pipeline = GraphPipeline(graph_store, embedder, metadata_store)
pipeline.build('./docs', source_name='my-docs')
# Unified search across all layers
searcher = Searcher(store, embedder)
unified = UnifiedSearch(searcher, graph_store, metadata_store)
results = unified.search(
'machine learning algorithms',
limit=5,
enrich_graph=True,
filters={'tags': ['python'], 'status': 'active'}
)
Quick semantic search:
from velocirag import Embedder, VectorStore, Searcher
embedder = Embedder()
store = VectorStore('./db', embedder)
store.add_directory('./docs')
searcher = Searcher(store, embedder)
results = searcher.search('neural networks', limit=10)
Incremental graph updates:
from velocirag import Embedder, GraphStore, GraphPipeline
# First run β full build, populates provenance
gs = GraphStore('./db/graph.db')
pipeline = GraphPipeline(gs, embedder=Embedder())
pipeline.build('./docs', source_name='my-docs') # full build
# Subsequent runs β only changed files get reprocessed
pipeline.build('./docs', source_name='my-docs') # incremental (automatic)
# Force full rebuild
pipeline.build('./docs', source_name='my-docs', force_rebuild=True)
# Multi-source graphs
pipeline.build('./project-a', source_name='project-a')
pipeline.build('./project-b', source_name='project-b') # isolated provenance
# Deleted files automatically cascade across all stores
# (vector, FTS5, graph, metadata) on next build
π» CLI Reference
# Index documents (graph + metadata built by default)
velocirag index <path> [--no-graph] [--no-metadata] [--gliner] [--full-graph] [--force]
[--source NAME] [--db PATH]
# Search across all layers (auto-routes through daemon if running)
velocirag search <query> [--limit N] [--threshold F] [--format text|json]
# Search daemon
velocirag serve [--db PATH] [-f] # start daemon (-f for foreground)
velocirag stop # stop daemon
velocirag status # check daemon health
# Metadata queries
velocirag query [--tags TAG] [--status S] [--project P] [--recent N]
# System health and status
velocirag health [--format text|json]
# Start MCP server
velocirag mcp [--db PATH] [--transport stdio|sse]
Options:
--no-graphβ Skip knowledge graph build--no-metadataβ Skip metadata extraction--full-graphβ Build graph WITH semantic similarity edges (~2GB extra RAM)--source NAMEβ Label for multi-source provenance isolation--forceβ Clear and rebuild from scratch--glinerβ Use GLiNER for entity extraction (requirespip install "velocirag[ner]")
π Performance
Real benchmarks on ByteByteGo/system-design-101 (418 files, 1,001 chunks):
| Metric | Value |
|---|---|
| Index (418 files) | 13.6s |
| Search (warm, 5 results) | 35β90ms |
| Graph build (light) | 2.1s β 2,397 nodes, 8,717 edges |
| Incremental update (1 file) | 1.3s |
| Reranker | Cross-encoder TinyBERT via ONNX |
| Install size | ~80MB (no PyTorch) |
| RAM usage | <1GB with all models loaded |
Production deployment (6,300+ chunks, 3 sources, 950 files):
| Metric | Value |
|---|---|
| Full search (warm) | 16ms avg, 2ms min |
| Full search (first run) | 22ms avg, 4ms min |
| Search P50 / P95 | 17ms / 55ms |
| Hit rate (100-query benchmark) | 99/100 |
| Graph | 3,125 nodes, 132,320 edges |
| Reranker | Cross-encoder TinyBERT via ONNX |
| RAM | <1GB with all models loaded |
βοΈ Configuration
| Environment Variable | Default | Description |
|---|---|---|
VELOCIRAG_DB | ./.velocirag | Database directory |
VELOCIRAG_SOCKET | /tmp/velocirag-daemon.sock | Daemon socket path |
NO_COLOR | β | Disable colored output |
Dependencies (all included in base install):
onnxruntimeβ ONNX inference (embedder + reranker)tokenizers+huggingface-hubβ model loadingfaiss-cpuβ vector similarity searchnetworkx+scikit-learnβ knowledge graph + topic clusteringnumpy,click,pyyaml,python-frontmatter
Optional extras:
pip install "velocirag[mcp]"β MCP server (addsfastmcp)pip install "velocirag[ner]"β GLiNER entity extraction (addsgliner, requires PyTorch)
π References
VelociRAG builds on these foundational works:
Core Fusion & Retrieval
Reciprocal Rank Fusion β Cormack, G. V., Clarke, C. L. A., & BΓΌttcher, S. (2009). "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods." SIGIR '09.
Core fusion algorithm for merging results across retrieval layers.
BM25 β Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). "Okapi at TREC-3." TREC-3.
Keyword search foundation via SQLite FTS5.
Embeddings & Neural IR
Sentence-BERT β Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. paper
Dense embedding architecture usingall-MiniLM-L6-v2.
MiniLM β Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers." NeurIPS 2020. paper
Efficient transformer distillation for production embedding models.
Reranking & Neural Models
Cross-Encoder Reranking β Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085. paper
Cross-attention reranking with TinyBERT on MS MARCO.
TinyBERT β Jiao, X., et al. (2020). "TinyBERT: Distilling BERT for Natural Language Understanding." Findings of EMNLP 2020. paper
Compressed BERT for fast reranking inference.
Vector Search & Systems
FAISS β Johnson, J., Douze, M., & JΓ©gou, H. (2019). "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data. paper
High-performance vector similarity search engine.
GLiNER β Zaratiana, U., Nzeyimana, A., & Holat, P. (2023). "GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer." arXiv:2311.08526. paper
Generalist NER for knowledge graph entity extraction (optional dependency).
π License
MIT β Use it anywhere, build anything.
Need agent integration help? Check AGENTS.md for machine-readable project context.
Built for agents who think fast and remember faster.
