Coder Bench
Benchmark tool for measuring MCP server effectiveness in LLM-assisted development
Ask AI about Coder Bench
Powered by Claude ยท Grounded in docs
I know everything about Coder Bench. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
mcp-coder-bench
A Rust-based benchmarking tool for measuring MCP (Model Context Protocol) server effectiveness in LLM-assisted development. Compare how different MCP servers impact token usage, cost, and task completion when using Claude Code.
Features
- MCP Server Benchmarking: Compare baseline Claude Code performance against MCP-enhanced scenarios
- Parallel Execution: Run multiple benchmark containers simultaneously with configurable concurrency
- Multi-Runtime Support: Works with Docker and Podman (auto-detection)
- Real-time Progress: Live streaming of tool calls and MCP usage during execution
- Statistical Analysis: Confidence intervals, significance testing (Welch's t-test), effect sizes (Cohen's d)
- Multiple Output Formats: JSON, CSV, Markdown (with charts), HTML (with SVG visualizations), terminal tables
- Workspace Isolation: Each run gets a fresh workspace copy to ensure reproducibility
- Structured Logging: Tracing-based logging to file and console
Installation
cargo install --path .
Or build from source:
cargo build --release
Requirements
- Docker or Podman
- Rust 1.75+ (for building)
ANTHROPIC_API_KEYenvironment variable
Quick Start
1. Initialize Configuration
# Basic configuration
mcp-coder-bench init > mcp-coder-bench.yaml
# With Boarder MCP server example
mcp-coder-bench init --with-boarder > mcp-coder-bench.yaml
2. Validate Configuration
# Check config, workspace, and Docker connectivity
mcp-coder-bench validate
# Also verify container image exists
mcp-coder-bench validate --check-image
3. Run Benchmarks
# Run with default configuration
mcp-coder-bench run
# Run with custom settings
mcp-coder-bench run -p 4 -n 5 # 4 parallel containers, 5 runs each
# Run a single scenario
mcp-coder-bench run --scenario baseline
# Dry-run to validate without executing
mcp-coder-bench run --dry-run
4. Analyze Results
# Terminal table output
mcp-coder-bench analyze results/20260123/
# With statistical analysis
mcp-coder-bench analyze results/20260123/ --stats
# Export as Markdown with charts
mcp-coder-bench analyze results/20260123/ --format markdown --stats -o report.md
# Export as HTML with SVG visualizations
mcp-coder-bench analyze results/20260123/ --format html -o report.html
# Exclude outliers from analysis
mcp-coder-bench analyze results/20260123/ --exclude-outliers
5. Compare Scenarios
# Basic comparison
mcp-coder-bench compare results/baseline results/with-mcp
# With significance testing
mcp-coder-bench compare results/baseline results/with-mcp --significance
# Multi-scenario comparison
mcp-coder-bench compare results/v1 results/v2 results/v3 --format markdown
Configuration
Example configuration file:
name: "MCP Benchmark"
scenarios:
- name: "baseline"
description: "No MCP servers"
mcp_config: {}
- name: "with-boarder"
description: "With Boarder MCP server"
mcp_config:
mcpServers:
boarder:
command: "/usr/local/bin/boarder"
args: ["mcp"]
task:
prompt_file: "prompts/task.md"
workspace: "test-repo/"
timeout_seconds: 600
reset_strategy: "copy" # copy, git, or none
execution:
runs_per_scenario: 3
parallelism: 2
container_runtime: "auto"
container:
image: "mcp-coder-bench:latest" # Optional custom image
builder: "docker" # Optional: docker, podman, or auto
output:
directory: "results/"
formats: ["json", "markdown"]
Workspace Reset Strategies
- copy: Creates isolated workspace copies for each run (recommended for reproducibility)
- git: Resets workspace using
git checkoutandgit clean - none: Workspace persists between runs (useful for incremental tasks)
Commands
run
Execute benchmark scenarios.
mcp-coder-bench run [OPTIONS]
Options:
-c, --config <FILE> Configuration file [default: mcp-coder-bench.yaml]
-p, --parallelism <N> Number of parallel containers [default: 1]
-n, --runs <N> Number of runs per scenario
-s, --scenario <NAME> Run only a specific scenario
-o, --output <DIR> Output directory for results
--runtime <RUNTIME> Container runtime (auto, docker, podman) [default: auto]
--rebuild Rebuild container image
--dry-run Validate without running
-v, --verbose Verbose output
analyze
Analyze results from a benchmark run.
mcp-coder-bench analyze <RESULTS_DIR> [OPTIONS]
Options:
-f, --format <FORMAT> Output format (json, csv, markdown, html, table) [default: table]
-o, --output <FILE> Output file (stdout if not specified)
--stats Include statistical analysis
--exclude-outliers Exclude outlier runs from analysis
--include-raw Include raw output in results
compare
Compare benchmark result sets.
mcp-coder-bench compare <DIR>... [OPTIONS]
Options:
-f, --format <FORMAT> Output format (json, csv, markdown, html, table) [default: table]
-o, --output <FILE> Output file (stdout if not specified)
--significance Include statistical significance testing
--confidence <LEVEL> Confidence level for tests [default: 0.95]
validate
Validate configuration without running benchmarks.
mcp-coder-bench validate [OPTIONS]
Options:
-c, --config <FILE> Configuration file [default: mcp-coder-bench.yaml]
--runtime <RUNTIME> Container runtime [default: auto]
--check-image Also verify container image exists
-v, --verbose Verbose output
init
Generate a sample configuration file.
mcp-coder-bench init [OPTIONS]
Options:
-o, --output <FILE> Output file (stdout if not specified)
--with-boarder Include example Boarder MCP configuration
Metrics Collected
- Token Usage: Input, output, cache creation, and cache read tokens
- Cost: Estimated USD cost based on Claude pricing
- Tool Calls: All tools used, with MCP tools tracked separately
- Wall Time: Actual execution duration
- Success Rate: Task completion status
Statistical Features
- Confidence Intervals: 95% CI for token usage and cost (t-distribution for small samples)
- Significance Testing: Welch's t-test for comparing scenarios
- Effect Sizes: Cohen's d with labels (negligible/small/medium/large)
- Outlier Detection: IQR-based outlier identification and filtering
- Distribution Analysis: Histograms and percentiles
Output Examples
Terminal Table
โญโโโโโโโโโโโโโโโฌโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโฎ
โ Scenario โ Runs โ Avg Tokens โ Avg Cost โ Avg Time โ Success โ
โโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโค
โ baseline โ 5 โ 125.3K โ $1.45 โ 342.1s โ 100% โ
โ with-boarder โ 5 โ 98.7K โ $1.12 โ 287.3s โ 100% โ
โฐโโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโฏ
MCP Tool Detection
MCP Tool Usage:
mcp__boarder__cut_from_source - 2 calls
mcp__boarder__clear_buffer - 1 call
Logging
Logs are written to ~/.mcp-coder-bench/logs/ with rotation. Set RUST_LOG for console output:
RUST_LOG=debug mcp-coder-bench run
License
MIT License - see LICENSE for details.
Contributing
Contributions welcome! Please ensure:
- All tests pass (
cargo test) - Code is formatted (
cargo fmt) - No clippy warnings (
cargo clippy)
