📦

Compress Tokens

MCP server that compresses text by removing unnecessary tokens using local LLM surprisal scoring

0 installs

Trust: 34 — Low

Blockchain

Ask AI about Compress Tokens

I know everything about Compress Tokens. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

compress-tokens

An MCP server that compresses text by removing low-information tokens using local LLM surprisal scoring via candle. No API keys. No cloud. Everything runs on your machine.

How it works

Each token in the input is scored by its surprisal — how unexpected it is given all preceding tokens, computed by a local quantized LLM. Tokens with low surprisal (predictable filler) are dropped; tokens with high surprisal (informative content) are kept. The remaining tokens are decoded back to text.

The primary use case is reducing context window usage: Claude Code can call compress_file on a large file and get back a shorter version that preserves the information-dense parts before reasoning over it.

Tools

Tool	Description
`compress_text`	Compress text with an explicit `keep_ratio` (fraction of tokens to keep, default `0.7`)
`compress_text_auto`	Compress text with automatic keep ratio via elbow detection on the surprisal curve
`compress_file`	Read a file, compress it, and return the result. Optionally write to `output_path`. Large files are chunked at 2048 tokens.

Installation

Prerequisites

Rust toolchain (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
Claude Code

Build

git clone https://github.com/Amir-Zecharia/compress-tokens
cd compress-tokens
cargo build --release

On macOS, enable Metal GPU acceleration:

cargo build --release --features metal

Register with Claude Code

claude mcp add compress-tokens /path/to/compress-tokens/target/release/compress-tokens --scope user

On first use, the server downloads the default model (~700MB) from HuggingFace and caches it locally. All subsequent starts load from cache.

Memory usage

The server loads the model at startup and exits automatically after 60 seconds of inactivity, freeing all memory.

State	RAM
Idle (no active requests for 60s)	0 MB — process has exited
Active	~700 MB (default model)

When a new request arrives after the server has exited, Claude Code restarts it automatically. Startup from local cache takes a few seconds.

The idle timeout can be changed or disabled with --idle-timeout:

# Exit after 30 seconds idle
claude mcp add compress-tokens "/path/to/compress-tokens --idle-timeout 30" --scope user

# Never exit (keep model in memory permanently)
claude mcp add compress-tokens "/path/to/compress-tokens --idle-timeout 0" --scope user

Model presets

Choose a model with --preset. Smaller models use less RAM and start faster; larger models produce better surprisal scores.

Preset	Model	RAM	Notes
`smollm`	SmolLM2-360M	~250 MB	Fastest, lowest RAM
`llama1b`	Llama-3.2-1B	~700 MB	Default
`llama3b`	Llama-3.2-3B	~2 GB	Best scoring quality

claude mcp add compress-tokens "/path/to/compress-tokens --preset smollm" --scope user

Or set via ~/.claude/settings.json:

{
  "mcpServers": {
    "compress-tokens": {
      "command": "/path/to/compress-tokens",
      "args": ["--preset", "smollm", "--idle-timeout", "30"]
    }
  }
}

Use a local GGUF file

claude mcp add compress-tokens "/path/to/compress-tokens --model /path/to/model.gguf" --scope user

Any LLaMA-architecture GGUF model works. --model takes priority over --preset and env vars.

Override via environment variables

{
  "mcpServers": {
    "compress-tokens": {
      "command": "/path/to/compress-tokens",
      "env": {
        "COMPRESS_MODEL_REPO": "bartowski/Llama-3.2-3B-Instruct-GGUF",
        "COMPRESS_MODEL_FILE": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"
      }
    }
  }
}

COMPRESS_MODEL_REPO is validated against a built-in allowlist of trusted HuggingFace repos. Use --model with a local path if you need a custom model.

Priority order: --model > env vars > --preset

Security flags

Flag	Description
`--no-file-access`	Disable the `compress_file` tool entirely. Use this when you want to restrict the server to in-memory text compression only.

claude mcp add compress-tokens "/path/to/compress-tokens --no-file-access" --scope user

Architecture

Startup loading: model loads at startup; process exits after idle timeout — RAM is zero when not in use
stdout discipline: only valid JSON-RPC 2.0 is ever written to stdout; all logging goes to stderr
No external APIs: all inference runs locally via candle (CPU or Metal)
Input limits: text inputs capped at 1 MB; stdin lines capped at 10 MB
Chunked file compression: files over 2048 tokens are split into chunks, each compressed independently
Supply chain integrity: tokenizer is SHA-256 verified at build time and embedded in the binary
Protocol: MCP 2024-11-05 over stdio

License

MIT