BM25 Turbo Rust Python WASM CLI
The fastest BM25 scoring engine: 2,300x faster than BM25S. 28K QPS on 8.8M docs. 5 BM25 variants (Robertson, Lucene, ATIRE, BM25L, BM25+). Memory-mapped persistence, BMW pruning, streaming indexing. Built-in HTTP server, MCP tool, HuggingFace Hub integration. Drop-in bm25s replacement for RAG & ML pipelines.
Ask AI about BM25 Turbo Rust Python WASM CLI
Powered by Claude Β· Grounded in docs
I know everything about BM25 Turbo Rust Python WASM CLI. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
BM25 Turbo β‘
Rust Β· Python Β· WASM Β· CLI
The fastest BM25 scoring engine. Period.
28,217 queries/second on 8.8 million documents. 8.6ms P50 latency. Precomputed sparse BM25 with BMW pruning, memory-mapped persistence, and zero-copy index loading. Built for RAG pipelines, search applications, and ML workflows.
use bm25_turbo::{BM25Builder, Method};
let index = BM25Builder::new()
.method(Method::Lucene) // Robertson, Lucene, ATIRE, BM25L, BM25+
.k1(1.5).b(0.75)
.build_from_corpus(&[
"Rust is a systems programming language",
"BM25 is a ranking function used in information retrieval",
"Machine learning models benefit from fast retrieval",
])?;
let results = index.search("information retrieval", 10)?;
for (id, score) in results.doc_ids.iter().zip(results.scores.iter()) {
println!("doc {} β {:.4}", id, score);
}
Why BM25 Turbo?
Most BM25 libraries compute scores at query time β scanning inverted indexes on every request. BM25 Turbo takes the opposite approach: precompute every BM25 score at index time into a compressed sparse column (CSC) matrix. Queries become sparse vector lookups with no math at serving time.
This makes BM25 Turbo the right choice when:
- You query the same index many times (RAG, reranking, batch evaluation)
- You need deterministic, reproducible scores (ML pipelines, experiments)
- You want the simplest possible API (3 lines to index, 1 line to query)
- You're building a search feature and don't need a full search server
Performance
Benchmarked on MS MARCO (8,841,823 documents, 509,962 queries) β the standard information retrieval benchmark. BMW pruning enabled. Single-threaded.
Query Throughput
Query Latency
Scaling Across Corpus Sizes
| Corpus | Documents | P50 Latency | QPS | nDCG@10 |
|---|---|---|---|---|
| SciFact | 5,183 | 67 ΞΌs | 682,127 | 0.665 |
| FiQA | 57,638 | 711 ΞΌs | 50,812 | 0.254 |
| MS MARCO | 8,841,823 | 8.6 ms | 28,217 | β |
Sub-millisecond on corpora under 100K docs. On multi-million document corpora, BMW pruning keeps latency competitive with inverted-index engines while maintaining the precomputed scoring advantage for batch workloads.
The Tradeoff
BM25 Turbo front-loads computation: index once, query millions of times. Indexing is slower because every BM25 score is precomputed and compressed into the CSC matrix. But every subsequent query is a sparse vector lookup β no math at serving time.
Features
5 BM25 Scoring Variants
| Variant | Formula | Best For |
|---|---|---|
| Robertson | Classic BM25 (Okapi) | Academic benchmarks |
| Lucene | Apache Lucene's variant | Production search (default) |
| ATIRE | IDF without +1 smoothing | Research comparisons |
| BM25L | Long document correction | Corpora with varying doc lengths |
| BM25+ | Lower-bound term frequency | Penalizing non-matching terms |
All variants support tunable k1, b, and delta parameters.
BMW (Block-Max WAND) Pruning
Skip non-competitive documents during top-k retrieval. Essential for million-document corpora:
let mut index = BM25Builder::new()
.build_from_corpus(&corpus)?;
// Build block-max index (one-time cost)
index.build_bmw_index()?;
// Queries now skip non-competitive blocks automatically
let results = index.search_approximate("distributed systems", 10)?;
BMW partitions the score matrix into blocks and maintains per-block upper bounds. During query evaluation, entire blocks are skipped when their maximum possible contribution can't beat the current k-th best score. This reduces the number of documents touched from millions to thousands.
Memory-Mapped Persistence
Save indexes to disk and reload them instantly with zero-copy memory mapping:
use bm25_turbo::persistence;
use std::path::Path;
// Save (serializes CSC matrix + vocabulary + parameters)
persistence::save(&index, Path::new("my_index.bm25"))?;
// Standard load (deserialize into RAM)
let index = persistence::load(Path::new("my_index.bm25"))?;
// Memory-mapped load (instant, zero-copy, ideal for huge indexes)
let mmap_index = persistence::load_mmap(Path::new("my_index.bm25"))?;
Memory-mapped indexes load in microseconds regardless of size. The OS pages data in on demand β a 10GB index starts serving queries immediately without waiting for the full file to be read.
Streaming Indexer
Index corpora larger than available memory by processing documents in configurable chunks:
use bm25_turbo::{StreamingBuilder, Method};
let mut builder = StreamingBuilder::new()
.chunk_size(100_000) // Process 100K docs at a time
.method(Method::Lucene);
builder.add_documents(&["doc one", "doc two", "doc three"]);
let index = builder.build()?;
Peak memory: O(chunk_size Γ avg_tokens) instead of O(total_corpus).
Write-Ahead Log (WAL)
Add and delete documents without rebuilding the entire index:
use bm25_turbo::wal::WriteAheadLog;
let mut wal = WriteAheadLog::new();
// Incremental updates
index.add_documents(&mut wal, &["new document about Rust"])?;
wal.delete_documents(&[42, 87])?;
// Compact when the WAL grows large
index.compact(&mut wal)?;
Built-in Tokenizer
17 language stemmers with configurable stopword removal:
use bm25_turbo::Tokenizer;
use rust_stemmers::Algorithm;
let tokenizer = Tokenizer::builder()
.stemmer(Algorithm::English)
.stopwords(vec!["the".into(), "a".into(), "an".into(), "is".into()])
.build()?;
let tokens = tokenizer.tokenize("Running distributed systems at scale");
// β ["run", "distribut", "system", "scale"]
Supported languages: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.
Distributed Search (gRPC)
Shard large indexes across multiple nodes and query them as one:
use bm25_turbo::distributed::{QueryCoordinator, ShardEndpoint};
// Define shard endpoints (one per machine/core)
let shards = vec![
ShardEndpoint { endpoint: "http://[::1]:50051".into(), shard_id: 0, doc_id_offset: 0 },
ShardEndpoint { endpoint: "http://[::1]:50052".into(), shard_id: 1, doc_id_offset: 500_000 },
];
// Coordinator fans out queries and merges results
let coordinator = QueryCoordinator::connect(shards).await?;
let results = coordinator.query("distributed query", 10).await?;
Interfaces
CLI
# Index a corpus
bm25-turbo index --input corpus.jsonl --output index.bm25 --field text
# Search
bm25-turbo search --index index.bm25 --query "information retrieval" -k 10
# Start HTTP server
bm25-turbo serve --index index.bm25 --port 8080
# Push/pull from HuggingFace Hub
bm25-turbo push --index index.bm25 --repo username/my-index
bm25-turbo pull --repo username/my-index --output index.bm25
Supports CSV, JSONL, JSON array, and plain text (one document per line). Auto-detects format when possible.
HTTP Server
$ bm25-turbo serve --index index.bm25 --port 8080
# Search
curl -X POST http://localhost:8080/search \
-H "Content-Type: application/json" \
-d '{"query": "machine learning", "k": 10}'
# Health check
curl http://localhost:8080/health
# Index statistics
curl http://localhost:8080/stats
MCP Server (AI Agent Integration)
Expose BM25 search as a tool for AI agents via the Model Context Protocol:
bm25-turbo mcp --index index.bm25 --port 8080
The MCP server exposes two tools:
bm25_searchβ Query the index with configurable top-kbm25_index_statsβ Get index metadata (doc count, vocab size)
Works with Claude, ChatGPT, and any MCP-compatible agent framework.
Python Bindings
from bm25_turbo_python import BM25
# Build an index
engine = BM25(method="lucene", k1=1.5, b=0.75)
engine.index(["Rust is fast", "Python is flexible", "BM25 ranks documents"])
# Search β returns (doc_ids, scores) tuple
doc_ids, scores = engine.search("fast programming", k=5)
for doc_id, score in zip(doc_ids, scores):
print(f" doc {doc_id}: {score:.4f}")
# Save / load
engine.save("my_index.bm25")
engine = BM25.load("my_index.bm25")
WASM (Browser / Edge)
import init, { WasmBM25 } from 'bm25-turbo-wasm';
await init();
const index = new WasmBM25(
["JavaScript runs everywhere",
"WebAssembly enables near-native performance",
"BM25 is a proven ranking algorithm"],
"lucene", // method (optional)
1.5, // k1 (optional)
0.75 // b (optional)
);
const results = index.search("native performance", 5);
console.log(results); // [{doc_id: 1, score: 0.82}, ...]
Bundle size: ~1.3 MB (gzipped: ~500 KB). No server needed β runs entirely in the browser.
HuggingFace Hub
Share and discover BM25 indexes on the HuggingFace Hub:
# Push your index
bm25-turbo push --index msmarco.bm25 --repo username/msmarco-bm25-turbo
# Pull someone else's index
bm25-turbo pull --repo username/msmarco-bm25-turbo --output msmarco.bm25
Pre-built indexes for common datasets (MS MARCO, NQ, SciFact, FiQA) can be shared across teams without re-indexing.
Installation
Rust
[dependencies]
bm25-turbo = "0.1"
CLI
cargo install bm25-turbo-cli
Python
pip install bm25-turbo
Requires Python 3.9-3.13. Pre-built wheels for Linux (x86_64, aarch64), macOS (x86_64, aarch64), and Windows (x86_64).
WASM / npm
npm install bm25-turbo-wasm
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BM25 Turbo β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββ β
β β Tokenizerβ β Scoring β β CSC β β BMW β β
β β 17 langs β β 5 variantsβ β Matrix β β Pruningβ β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ βββββ¬βββββ β
β β β β β β
β ββββββ΄βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββ β
β β Index Builder / Streaming β β
β ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββ β
β β Persistence Layer β β
β β Binary Β· Memory-Mapped Β· WAL Β· Hub β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββ€
β CLI β HTTP β MCP β Python β WASM β
β β Server β Server β Bindings β (npm) β
ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββββ
How It Works
- Tokenize β Split documents into stemmed tokens with configurable stopword removal
- Score β Compute BM25 scores for every (term, document) pair using the chosen variant
- Compress β Store scores in a Compressed Sparse Column (CSC) matrix (only non-zero entries)
- Serve β Queries are sparse vector dot products: look up columns for query terms, accumulate scores, return top-k
The CSC format stores only non-zero BM25 scores. For a corpus of 8.8M documents with 500K vocabulary, this typically compresses to 2-4 GB β far smaller than a dense matrix (which would be ~17 TB).
Configuration
BM25 Parameters
| Parameter | Default | Range | Effect |
|---|---|---|---|
k1 | 1.5 | 0.0-3.0 | Term frequency saturation. Higher = more weight to repeated terms |
b | 0.75 | 0.0-1.0 | Length normalization. 0 = no normalization, 1 = full normalization |
delta | 0.5 | 0.0-inf | BM25L/BM25+ lower bound (only used with those variants) |
Choosing a Variant
- Start with Lucene β It's the most widely used and tested variant
- Use Robertson if you need exact comparisons with academic papers
- Use BM25L if your corpus has extreme document length variation
- Use BM25+ if short queries return too many irrelevant results
- Use ATIRE if you're reproducing ATIRE research results
Benchmarks
Reproducing Our Numbers
# Clone and build
git clone https://github.com/TheSauceSuite/BM25-Turbo-Rust-Python-WASM-CLI-
cd bm25-turbo
cargo build --release -p bm25-turbo-bench
# Run on SciFact (quick, 5K docs)
cargo run -p bm25-turbo-bench --release --bin beir_bench -- --datasets scifact
# Run on MS MARCO (full, 8.8M docs β requires ~16GB RAM)
cargo run -p bm25-turbo-bench --release --bin beir_bench -- --datasets msmarco --max-queries 1000
Datasets are automatically downloaded from the BEIR benchmark suite. First run may take a few minutes to download.
BEIR Benchmark Results
| Dataset | Documents | Vocab | Index Time | QPS | P50 Latency | nDCG@10 |
|---|---|---|---|---|---|---|
| SciFact | 5,183 | 19,927 | 223 ms | 682,127 | 67 ΞΌs | 0.665 |
| FiQA | 57,638 | 67,893 | 1.8 s | 50,812 | 711 ΞΌs | 0.254 |
| MS MARCO | 8,841,823 | ~500K | 66 min | 28,217 | 8.6 ms | β |
MS MARCO nDCG measured on the dev set (6,980 queries). QPS and latency measured on a random sample of 1,000 queries. All benchmarks single-threaded on a consumer desktop.
Use Cases
RAG (Retrieval-Augmented Generation)
BM25 Turbo is ideal as the retrieval stage in RAG pipelines. Index your knowledge base once, then retrieve relevant context for every LLM query:
let results = index.search(&user_question, 5)?;
let context = results.doc_ids.iter()
.map(|id| documents[*id as usize].as_str())
.collect::<Vec<_>>()
.join("\n\n");
// Feed context to your LLM
Hybrid Search (BM25 + Embeddings)
Combine BM25 lexical scores with dense embedding similarity for best-of-both-worlds retrieval:
// BM25 lexical retrieval
let bm25_results = index.search(query, 100)?;
// Dense retrieval (from your embedding model)
let dense_results = embedding_index.search(query_embedding, 100)?;
// Reciprocal Rank Fusion
let fused = rrf_merge(&bm25_results, &dense_results, k=60);
Batch Evaluation
Score hundreds of thousands of queries against a corpus for ML experiments:
for query in &evaluation_queries {
let results = index.search(query, 10)?;
// Compute nDCG, MAP, recall...
}
// At 28K QPS, 500K queries finish in ~18 seconds
Comparison with Other Tools
BM25 Turbo is not a full-text search engine. It's a focused BM25 scoring library. If you need phrase queries, faceted search, or query DSLs, use Tantivy or Elasticsearch. If you need the fastest possible BM25 scores with the simplest possible API β or you're migrating from bm25s and need 2,000x more speed β use BM25 Turbo.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
# Run tests
cargo test --workspace --exclude bm25-turbo-wasm
# Run clippy
cargo clippy --workspace --exclude bm25-turbo-wasm -- -D warnings
# Build WASM
cd bm25-turbo-wasm && wasm-pack build --target web
License
BM25 Turbo is dual-licensed:
Open Source β AGPL v3
This software is licensed under the GNU Affero General Public License v3.0. You are free to use, modify, and distribute this software under the terms of the AGPL. If you deploy a modified version as a network service, you must make the complete source code of your modified version available to users of that service under the AGPL.
Commercial License
For companies and individuals who cannot comply with the AGPL (e.g., you want to use BM25 Turbo in proprietary software without open-sourcing your code), a commercial license is available.
Purchase a commercial license β
The commercial license removes all AGPL copyleft obligations and includes:
- Use in proprietary/closed-source applications
- No requirement to disclose your source code
- Priority support
- Custom integration assistance
For licensing inquiries: alessandrobenigni.com
