Sophon
Honest token economics for MCP agents. Rust binary, zero ML, reproducible benchmarks.
Ask AI about Sophon
Powered by Claude Β· Grounded in docs
I know everything about Sophon. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Sophon
Deterministic context compression for MCP agents. One Rust binary. Zero ML at query time. Reproducible benchmarks, real-data measurements.
Sophon is a deterministic context layer for agents speaking the Model Context Protocol. It compresses prompts, conversation memory, code digests, file deltas, and shell output β without an embedding model at query time, without a GPU, and without API keys.
Single 5.2 MB Rust binary. MCP-native. cl100k_base-accurate. Default build pulls no Python, no ML weights, no network.
What it does, in 30 seconds
| Tool | What it solves |
|---|---|
compress_prompt | Long structured prompt β keep only sections relevant to the query |
compress_history | Growing conversation β summary + facts + recent window + optional retrieval |
compress_output | Shell stdout/stderr β 21 domain-aware filters (git, cargo, docker, kubectl, JSON, β¦) |
read_file_delta / write_file_delta | Re-reads + edits β diffs only, never the whole file |
encode_fragments | Repeated boilerplate β single token reference |
update_memory | Append turn β JSONL persist + incremental rolling summary |
navigate_codebase | Repo digest with tree-sitter / regex + PageRank, ranked by query |
11 MCP tools total (full table below).
Real numbers β measured on this repo's own dev cycle
We built four independent benches that each capture a different chunk of an agent's tool traffic. All four run against this repo's actual git history + working tree on the operator's machine. Reproducible byte-for-byte by anyone with cargo build --release.
| Dimension | What it measures | Saved | Bench |
|---|---|---|---|
| history | compress_history over real commits | 94.6 % | real_session_capture.py |
| shell | compress_output on real git/cargo/gh/ls stdout | 84.4 % | real_session_shell.py |
| filereads | compress_prompt on real Rust / Python / Markdown / TOML files | 71.7 % | real_session_filereads.py |
| search | compress_output on real grep/find patterns | 79.5 % | real_session_search.py |
| π― Weighted blend (35/30/20/15) | typical agent session estimate | 84.7 % | real_session_holistic.py |
real_session_holistic.py runs all four sub-benches with --json, parses them, and produces the weighted blend. Default weights reflect this repo's observed shape; pass --weights "history=0.4,..." to model your own workload.
USD economy on Claude Opus 4.7
| Saved per session | |
|---|---|
| Naive input pricing ($15/MT) | $2.03 |
| With prompt caching (25-turn reads at $1.50/MT) | $3.24 |
Pass
--model sonnetor--model haikutoreal_session_deep_dive.pyif you're re-pricing for a cheaper tier.
Where each dimension falls short (we say it ourselves)
- history measures only what
gitcaptures (commits + diffs) β typically ~5-10 % of a real session's tool traffic. The 94.6 % is the upper bound, not the typical case. - shell mixes commands that compress well (
git diff95 %) with commands that don't (gh repo view --jsonadds tokens, β9 %). 84.4 % is a real-world average, not a curated highlight. - filereads uncovered that
compress_prompton raw source files compresses by budget cap, not by query routing β same file with 3 different queries β identical output. Section detection only fires on structured input (Markdown headers, XML tags). Documented inline in the bench. - search depends entirely on YOUR repo's state. A repo with no TODOs gets 0 % on
grep TODO.
The blended 84.7 % is napkin-math from a linear weighted average across four real measurements. Not a cherry-picked synthetic. Run the benches yourself to verify.
Other reproducible benchmarks (synthetic, on-thesis)
| Test | Result | Bench |
|---|---|---|
compress_output across 18 command families | 90.1 % weighted aggregate | compress_output_per_command.py |
| 25-turn synthetic Claude Code session | 68.1 % session tokens saved | session_token_economics.py |
compress_prompt across 22 prompt shapes | 70.2 % mean, 36 ms mean latency | prompt_compression_extended.py |
| Code retrieval on "where is X?" questions | recall@3 = 70 % (vs grep 10 %, FULL 20 %) | repo_qa.py |
| vs LLMLingua-2 on structured prompts | +8.9 pt accuracy at 35Γ lower latency | llmlingua_compare.py |
| Sophon + Anthropic prompt caching | +24 % tokens / +49 % $ on top of caching | sophon_plus_prompt_caching.py |
| Sophon + mem0 | Additional savings on retrieved memories | sophon_plus_mem0.py |
Why Sophon β "in front of X"
Sophon is not a memory platform, a recall system, an OCR stack, or a replacement for provider-side caching. It's a deterministic compressor that slots in front of whatever memory / cache / code-nav layer you already use, and attacks the tokens those layers can't.
In front of Anthropic / OpenAI prompt caching
Provider caching handles the static half of a request β system prompt, tool definitions, reused documents. It doesn't touch the dynamic half (growing conversation history, tool outputs). Sophon compresses exactly that half. The two stack cleanly.
+24 % tokens / +49 % $ saved on top of prompt caching on a 25-turn Claude session β because the uncached dynamic block is billed at 10Γ the cached rate. See
sophon_plus_prompt_caching.py.
In front of mem0 / Letta / Zep / Graphiti
Memory systems retrieve the right memories. Sophon shrinks what gets sent to the LLM after retrieval. If mem0 returns 2 kB of raw memories, compress_prompt keeps only the sections the query actually references.
Honest caveat: on very short retrieved blocks (< ~200 tokens) Sophon's wrapper adds overhead and you should pass through. The bench reports this directly.
In front of Claude Code / Cursor / Cline
Primary use case. Every repeat file read becomes a read_file_delta; every shell command output goes through compress_output; every repeated boilerplate block gets a fragment_cache token. Install transparently with sophon hook install --agent claude --global.
In front of a RAG pipeline
navigate_codebase produces a PageRanked repo digest that a RAG retriever would otherwise spend expensive embedding calls to build. Tree-sitter / regex symbol extraction over 11 languages, sub-second.
When NOT to use Sophon
- Long-form conversational recall above 80 % β Sophon caps at ~40 % on LOCOMO and we don't chase it. Run mem0 / Letta / Zep for recall, then optionally pipe their output through Sophon.
- Multi-hop reasoning on massive documents β that's HippoRAG or GraphRAG.
- OCR / PDF layout β out of scope. Use Docling / Marker / Unstructured upstream.
- Very small inputs (< ~200 tokens) β Sophon's section scaffolding can cost more than it saves.
Quick start
Install via npm (recommended)
npm install -g mcp-sophon
sophon doctor # verify install + show config
The postinstall script downloads the right prebuilt binary for your platform from the GitHub Releases page. Supported: macOS arm64/x64, Linux arm64/x64, Windows x64.
Build from source
git clone https://github.com/lacausecrypto/mcp-sophon
cd mcp-sophon/sophon
cargo build --release -p mcp-integration # ~5.2 MB binary
Optional features:
# 11-language tree-sitter AST extraction (~25 MB):
cargo build --release -p mcp-integration --features codebase-navigator/tree-sitter
# BGE-small semantic embedder (~34 MB), activate with SOPHON_EMBEDDER=bge:
cargo build --release -p mcp-integration --features bge
# All features (~42 MB):
cargo build --release -p mcp-integration --features "codebase-navigator/tree-sitter,bge"
Requires Rust 1.75+.
Wire it into an MCP client
Most clients accept this snippet (Claude Desktop, Claude Code, Cursor, Cline, Continue):
{
"mcpServers": {
"sophon": {
"command": "sophon",
"args": ["serve"]
}
}
}
Run sophon doctor to print the right config path for your client.
Recommended runtime setup
# Persistent memory + on-disk retriever store + BM25+Hash hybrid
export SOPHON_MEMORY_PATH=~/.sophon/memory.jsonl
export SOPHON_RETRIEVER_PATH=~/.sophon/retriever
export SOPHON_HYBRID=1
sophon serve
Quick CLI
sophon exec -- cargo test # run + compress combined output
sophon compress-prompt --prompt ./system.txt --query "rust errors" --max-tokens 500
sophon hook install --agent claude --global # transparent Claude Code integration
sophon stats --period session # token savings rollup
Programmatic (one-shot JSON-RPC)
echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"compress_prompt","arguments":{"prompt":"<rust>?: operator</rust><web>fetch()</web>","query":"rust errors","max_tokens":500}}}' \
| sophon serve
What the binary ships
11 MCP tools, all stdio:
| Tool | What it does |
|---|---|
compress_prompt | Keep query-relevant sections of a long prompt |
compress_history | Summary + facts + recent + optional retrieval over the conversation |
compress_output | Strip noise from command stdout/stderr (21 domain filters + JsonStructural) |
navigate_codebase | tree-sitter / regex digest of a repo, PageRanked by query |
update_memory | Append messages, JSONL persist, optional rolling summary |
read_file_delta | Version/hash-aware file read, unchanged β minimal payload |
write_file_delta | Send edits as diffs, not full files |
encode_fragments / decode_fragments | Detect repeated boilerplate, swap with tokens |
count_tokens | cl100k_base-accurate token count |
get_token_stats | Session-level savings rollup |
Binary sizes by feature set:
| Build | Size |
|---|---|
| Default (regex extractors, HashEmbedder) | 5.2 MB |
| + tree-sitter (11 languages) | ~25 MB |
| + BGE semantic embedder | ~34 MB |
| All features | ~42 MB |
MCP protocol: 2025-06-18. notifications/cancelled actually drops the response (since v0.5.4). Structured JSON-RPC error codes (-32000..-32099 reserved for Sophon). Infallible dispatcher β a malformed request can't kill the stdio loop.
Configuration
Run sophon doctor to see every SOPHON_* env var currently set with validation warnings. Full catalogue (24 flags) lives in runtime_flags.rs. The flags worth knowing:
| Flag | Effect | Cost |
|---|---|---|
SOPHON_RETRIEVER_PATH=/dir | Activate the semantic retriever (chunk store on disk) | ~0 |
SOPHON_MEMORY_PATH=/file.jsonl | Persistent conversation memory across sophon serve runs | ~0 |
SOPHON_HYBRID=1 | BM25 sparse-lexical + HashEmbedder fused via RRF | ~1 ms |
SOPHON_ROLLING_SUMMARY=1 | Build rolling summary at update_memory time, not at query time | LLM call moved to ingest |
SOPHON_CHUNK_TARGET=500 | Bigger chunks preserve cross-sentence context | ~0 |
SOPHON_EMBEDDER=bge | Swap HashEmbedder for BGE-small (needs --features bge) | model load at startup |
SOPHON_LLM_CMD="claude -p --model haiku" | LLM shell-out command (used by summarizer when configured) | per-call subprocess |
Deprecated v0.4.0 recall-chasing flags β SOPHON_HYDE, SOPHON_FACT_CARDS, SOPHON_ENTITY_GRAPH, SOPHON_ADAPTIVE, SOPHON_LLM_RERANK, SOPHON_TAIL_SUMMARY, SOPHON_REACT, SOPHON_GRAPH_MEMORY, SOPHON_MULTIHOP_LLM β chase LOCOMO recall, an axis we no longer optimise. Still functional but sophon doctor flags them. Removed in a future major.
Honest limitations
The full list lives in BENCHMARK.md Β§ 8. Headlines:
- LOCOMO conversational recall caps at ~40 %. mem0 / HippoRAG hit 80-90 % with neural retrieval at query time β we chose determinism + sub-100 ms p99 instead. Pipe mem0 in front of Sophon if you need that recall.
- HashEmbedder is keyword-bound. "favorite food" β "weakness for ginger snaps" doesn't match. Activate BGE (
SOPHON_EMBEDDER=bge) for semantic recall β costs +25 MB binary + model load. - No multimodal ingestion. Images / PDFs / audio out of scope. Run Docling / Marker / Unstructured upstream.
- Rolling summary doesn't help on small sessions. When the un-summarised tail fits the budget, the rolling cache is a no-op. Useful for long-running sessions with
SOPHON_LLM_CMDset. - Some commands don't compress.
gh repo view --jsonadds tokens,git log --onelinesaves 0.4 %. Sophon's job isn't to compress already-compact output β it's to compress redundant verbose output. The benches name the gaps explicitly.
Project layout
.
βββ README.md β you are here
βββ BENCHMARK.md β full per-section benchmark detail
βββ CHANGELOG.md β version history + deprecated numbers
βββ benchmarks/ β reproducible scripts for every number above
βββ npm/ β npm wrapper package
βββ sophon/crates/ β 11-crate Rust workspace
βββ prompt-compressor/ compress_prompt
βββ memory-manager/ compress_history, update_memory, rolling summary
βββ delta-streamer/ read/write_file_delta
βββ fragment-cache/ encode/decode_fragments
βββ semantic-retriever/ chunker + HashEmbedder + BM25 + entity graph
βββ output-compressor/ 21 command-aware filters + JsonStructural
βββ codebase-navigator/ tree-sitter / regex + PageRank
βββ cli-hooks/ transparent agent installer
βββ mcp-integration/ stdio server, async dispatch, cancellation
Contributing
PRs welcome. Run the test suite:
cd sophon && cargo test --workspace --lib --tests --exclude prompt-compressor # 405 tests
cd sophon && cargo test --features codebase-navigator/tree-sitter # +AST tests
cd sophon-py && .venv/bin/pytest tests/ # 4 Python tests
Every benchmark claim is reproducible β pointers to the scripts live in BENCHMARK.md. If a number doesn't reproduce on your machine, open an issue.
Particularly welcome:
- TypeScript bindings (Python bindings ship in
sophon-py/) ghfamily filter (gh run list,gh pr list,gh repo view --json) β the bench shows this is currently a gapSOPHON_EMBEDDER_CMDshell-out plugin pattern (mirror ofSOPHON_LLM_CMD) for Voyage / OpenAI / Cohere- Multi-repo
real_session_holistic.pyruns against popular open-source repos
License
MIT. See LICENSE.
