Compress Tokens
MCP server that compresses text by removing unnecessary tokens using local LLM surprisal scoring
Ask AI about Compress Tokens
Powered by Claude Β· Grounded in docs
I know everything about Compress Tokens. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
compress-tokens
An MCP server that compresses text by removing low-information tokens using local LLM surprisal scoring via candle. No API keys. No cloud. Everything runs on your machine.
How it works
Each token in the input is scored by its surprisal β how unexpected it is given all preceding tokens, computed by a local quantized LLM. Tokens with low surprisal (predictable filler) are dropped; tokens with high surprisal (informative content) are kept. The remaining tokens are decoded back to text.
The primary use case is reducing context window usage: Claude Code can call compress_file on a large file and get back a shorter version that preserves the information-dense parts before reasoning over it.
Tools
| Tool | Description |
|---|---|
compress_text | Compress text with an explicit keep_ratio (fraction of tokens to keep, default 0.7) |
compress_text_auto | Compress text with automatic keep ratio via elbow detection on the surprisal curve |
compress_file | Read a file, compress it, and return the result. Optionally write to output_path. Large files are chunked at 2048 tokens. |
Installation
Prerequisites
- Rust toolchain (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - Claude Code
Build
git clone https://github.com/Amir-Zecharia/compress-tokens
cd compress-tokens
cargo build --release
On macOS, enable Metal GPU acceleration:
cargo build --release --features metal
Register with Claude Code
claude mcp add compress-tokens /path/to/compress-tokens/target/release/compress-tokens --scope user
On first use, the server downloads the default model (~700MB) from HuggingFace and caches it locally. All subsequent starts load from cache.
Memory usage
The server loads the model at startup and exits automatically after 60 seconds of inactivity, freeing all memory.
| State | RAM |
|---|---|
| Idle (no active requests for 60s) | 0 MB β process has exited |
| Active | ~700 MB (default model) |
When a new request arrives after the server has exited, Claude Code restarts it automatically. Startup from local cache takes a few seconds.
The idle timeout can be changed or disabled with --idle-timeout:
# Exit after 30 seconds idle
claude mcp add compress-tokens "/path/to/compress-tokens --idle-timeout 30" --scope user
# Never exit (keep model in memory permanently)
claude mcp add compress-tokens "/path/to/compress-tokens --idle-timeout 0" --scope user
Model presets
Choose a model with --preset. Smaller models use less RAM and start faster; larger models produce better surprisal scores.
| Preset | Model | RAM | Notes |
|---|---|---|---|
smollm | SmolLM2-360M | ~250 MB | Fastest, lowest RAM |
llama1b | Llama-3.2-1B | ~700 MB | Default |
llama3b | Llama-3.2-3B | ~2 GB | Best scoring quality |
claude mcp add compress-tokens "/path/to/compress-tokens --preset smollm" --scope user
Or set via ~/.claude/settings.json:
{
"mcpServers": {
"compress-tokens": {
"command": "/path/to/compress-tokens",
"args": ["--preset", "smollm", "--idle-timeout", "30"]
}
}
}
Use a local GGUF file
claude mcp add compress-tokens "/path/to/compress-tokens --model /path/to/model.gguf" --scope user
Any LLaMA-architecture GGUF model works. --model takes priority over --preset and env vars.
Override via environment variables
{
"mcpServers": {
"compress-tokens": {
"command": "/path/to/compress-tokens",
"env": {
"COMPRESS_MODEL_REPO": "bartowski/Llama-3.2-3B-Instruct-GGUF",
"COMPRESS_MODEL_FILE": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"
}
}
}
}
COMPRESS_MODEL_REPO is validated against a built-in allowlist of trusted HuggingFace repos. Use --model with a local path if you need a custom model.
Priority order: --model > env vars > --preset
Security flags
| Flag | Description |
|---|---|
--no-file-access | Disable the compress_file tool entirely. Use this when you want to restrict the server to in-memory text compression only. |
claude mcp add compress-tokens "/path/to/compress-tokens --no-file-access" --scope user
Architecture
- Startup loading: model loads at startup; process exits after idle timeout β RAM is zero when not in use
- stdout discipline: only valid JSON-RPC 2.0 is ever written to stdout; all logging goes to stderr
- No external APIs: all inference runs locally via candle (CPU or Metal)
- Input limits: text inputs capped at 1 MB; stdin lines capped at 10 MB
- Chunked file compression: files over 2048 tokens are split into chunks, each compressed independently
- Supply chain integrity: tokenizer is SHA-256 verified at build time and embedded in the binary
- Protocol: MCP 2024-11-05 over stdio
License
MIT
