io.github.base76-research-lab/token-compressor
Compress prompts 40-60% using local LLM + embedding validation. Preserves all conditionals.
Ask AI about io.github.base76-research-lab/token-compressor
Powered by Claude Β· Grounded in docs
I know everything about io.github.base76-research-lab/token-compressor. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
token-compressor
Reduce LLM prompt tokens by 30β70% while preserving semantic meaning.
mcp-name: io.github.base76-research-lab/token-compressor
Semantic prompt compression for LLM workflows. Reduce token usage by 40β60% without losing meaning.
Built by Base76 Research Lab β research into epistemic AI architecture.
Live demo
Intent Compiler MVP is now live and uses this project as part of the idea -> spec -> compressed output flow:
- Live: https://intent-compiler-mvp.pages.dev
- Product repo: https://github.com/base76-research-lab/token-compressor
What it does
token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:
- LLM compression β a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
- Embedding validation β cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.85) β if not, the original is sent unchanged
The result: shorter prompts, lower costs, same intent.
Input prompt (300 tokens)
β
LLM compresses
β
Embedding validates (cosine β₯ 0.85?)
β
Pass β compressed (120 tokens) Fail β original (300 tokens)
Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.
Requirements
- Python 3.10+
- Ollama running locally
- Two models pulled:
ollama pull llama3.2:1b
ollama pull nomic-embed-text
- Python dependencies:
pip install ollama numpy
Quick start
from compressor import LLMCompressEmbedValidate
pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")
print(result.output_text) # compressed (or original if validation failed)
print(result.report()) # MODE / COVERAGE / TOKENS saved
Result object:
| Field | Description |
|---|---|
output_text | Text to send to your LLM |
mode | compressed / raw_fallback / skipped |
coverage | Cosine similarity (0.0β1.0) |
tokens_in | Estimated input tokens |
tokens_out | Estimated output tokens |
tokens_saved | Difference |
CLI usage
echo "Your long prompt here..." | python3 cli.py
Output: compressed text on stdout, stats on stderr.
Claude Code hook (recommended setup)
Add to your ~/.claude/settings.json under hooks β UserPromptSubmit:
{
"type": "command",
"command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}
This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.
MCP server
The MCP server exposes compression as a tool callable from Claude Code and any MCP-compatible client.
Install:
pip install token-compressor-mcp
Tool: compress_prompt
- Input:
text(string) - Output: compressed text + stats footer
Claude Code MCP config (~/.claude/settings.json):
{
"mcpServers": {
"token-compressor": {
"command": "uvx",
"args": ["token-compressor-mcp"]
}
}
}
Or from source:
{
"mcpServers": {
"token-compressor": {
"command": "python3",
"args": ["-m", "token_compressor_mcp"],
"cwd": "/path/to/token-compressor"
}
}
}
Configuration
pipeline = LLMCompressEmbedValidate(
threshold=0.85, # cosine similarity floor (lower = more aggressive)
min_tokens=80, # skip pipeline below this (not worth compressing)
compress_model="llama3.2:1b",
embed_model="nomic-embed-text",
)
How it works
Stage 1 β LLM compression
The compression prompt instructs the model to:
- Preserve all conditionals (
if,only if,unless,when,but only) - Preserve all negations
- Remove filler, hedging, redundancy
- Target 40β60% of original length
Stage 2 β Embedding validation
Computes cosine similarity between the original and compressed text using nomic-embed-text. If similarity falls below threshold, the original is returned unchanged. This prevents silent meaning loss.
Results
Tested across Swedish and English prompts, technical and natural language:
| Input | Tokens in | Tokens out | Saved |
|---|---|---|---|
| Research abstract (EN) | 89 | 38 | 57% |
| Session intent (SV) | 32 | 18 | 44% |
| Technical instruction | 47 | 22 | 53% |
| Short command (<80t) | β | β | skipped |
Research background
This tool implements the architecture from:
WikstrΓΆm, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535
Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.
License
MIT β Base76 Research Lab, Sweden
