📦

Coder Bench

Benchmark tool for measuring MCP server effectiveness in LLM-assisted development

0 installs

Trust: 34 — Low

Ask AI about Coder Bench

I know everything about Coder Bench. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

mcp-coder-bench

A Rust-based benchmarking tool for measuring MCP (Model Context Protocol) server effectiveness in LLM-assisted development. Compare how different MCP servers impact token usage, cost, and task completion when using Claude Code.

Features

MCP Server Benchmarking: Compare baseline Claude Code performance against MCP-enhanced scenarios
Parallel Execution: Run multiple benchmark containers simultaneously with configurable concurrency
Multi-Runtime Support: Works with Docker and Podman (auto-detection)
Real-time Progress: Live streaming of tool calls and MCP usage during execution
Statistical Analysis: Confidence intervals, significance testing (Welch's t-test), effect sizes (Cohen's d)
Multiple Output Formats: JSON, CSV, Markdown (with charts), HTML (with SVG visualizations), terminal tables
Workspace Isolation: Each run gets a fresh workspace copy to ensure reproducibility
Structured Logging: Tracing-based logging to file and console

Installation

cargo install --path .

Or build from source:

cargo build --release

Requirements

Docker or Podman
Rust 1.75+ (for building)
ANTHROPIC_API_KEY environment variable

Quick Start

1. Initialize Configuration

# Basic configuration
mcp-coder-bench init > mcp-coder-bench.yaml

# With Boarder MCP server example
mcp-coder-bench init --with-boarder > mcp-coder-bench.yaml

2. Validate Configuration

# Check config, workspace, and Docker connectivity
mcp-coder-bench validate

# Also verify container image exists
mcp-coder-bench validate --check-image

3. Run Benchmarks

# Run with default configuration
mcp-coder-bench run

# Run with custom settings
mcp-coder-bench run -p 4 -n 5  # 4 parallel containers, 5 runs each

# Run a single scenario
mcp-coder-bench run --scenario baseline

# Dry-run to validate without executing
mcp-coder-bench run --dry-run

4. Analyze Results

# Terminal table output
mcp-coder-bench analyze results/20260123/

# With statistical analysis
mcp-coder-bench analyze results/20260123/ --stats

# Export as Markdown with charts
mcp-coder-bench analyze results/20260123/ --format markdown --stats -o report.md

# Export as HTML with SVG visualizations
mcp-coder-bench analyze results/20260123/ --format html -o report.html

# Exclude outliers from analysis
mcp-coder-bench analyze results/20260123/ --exclude-outliers

5. Compare Scenarios

# Basic comparison
mcp-coder-bench compare results/baseline results/with-mcp

# With significance testing
mcp-coder-bench compare results/baseline results/with-mcp --significance

# Multi-scenario comparison
mcp-coder-bench compare results/v1 results/v2 results/v3 --format markdown

Configuration

Example configuration file:

name: "MCP Benchmark"

scenarios:
  - name: "baseline"
    description: "No MCP servers"
    mcp_config: {}

  - name: "with-boarder"
    description: "With Boarder MCP server"
    mcp_config:
      mcpServers:
        boarder:
          command: "/usr/local/bin/boarder"
          args: ["mcp"]

task:
  prompt_file: "prompts/task.md"
  workspace: "test-repo/"
  timeout_seconds: 600
  reset_strategy: "copy"  # copy, git, or none

execution:
  runs_per_scenario: 3
  parallelism: 2
  container_runtime: "auto"

container:
  image: "mcp-coder-bench:latest"  # Optional custom image
  builder: "docker"                 # Optional: docker, podman, or auto

output:
  directory: "results/"
  formats: ["json", "markdown"]

Workspace Reset Strategies

copy: Creates isolated workspace copies for each run (recommended for reproducibility)
git: Resets workspace using git checkout and git clean
none: Workspace persists between runs (useful for incremental tasks)

Commands

`run`

Execute benchmark scenarios.

mcp-coder-bench run [OPTIONS]

Options:
  -c, --config <FILE>       Configuration file [default: mcp-coder-bench.yaml]
  -p, --parallelism <N>     Number of parallel containers [default: 1]
  -n, --runs <N>            Number of runs per scenario
  -s, --scenario <NAME>     Run only a specific scenario
  -o, --output <DIR>        Output directory for results
      --runtime <RUNTIME>   Container runtime (auto, docker, podman) [default: auto]
      --rebuild             Rebuild container image
      --dry-run             Validate without running
  -v, --verbose             Verbose output

`analyze`

Analyze results from a benchmark run.

mcp-coder-bench analyze <RESULTS_DIR> [OPTIONS]

Options:
  -f, --format <FORMAT>     Output format (json, csv, markdown, html, table) [default: table]
  -o, --output <FILE>       Output file (stdout if not specified)
      --stats               Include statistical analysis
      --exclude-outliers    Exclude outlier runs from analysis
      --include-raw         Include raw output in results

`compare`

Compare benchmark result sets.

mcp-coder-bench compare <DIR>... [OPTIONS]

Options:
  -f, --format <FORMAT>     Output format (json, csv, markdown, html, table) [default: table]
  -o, --output <FILE>       Output file (stdout if not specified)
      --significance        Include statistical significance testing
      --confidence <LEVEL>  Confidence level for tests [default: 0.95]

`validate`

Validate configuration without running benchmarks.

mcp-coder-bench validate [OPTIONS]

Options:
  -c, --config <FILE>       Configuration file [default: mcp-coder-bench.yaml]
      --runtime <RUNTIME>   Container runtime [default: auto]
      --check-image         Also verify container image exists
  -v, --verbose             Verbose output

`init`

Generate a sample configuration file.

mcp-coder-bench init [OPTIONS]

Options:
  -o, --output <FILE>       Output file (stdout if not specified)
      --with-boarder        Include example Boarder MCP configuration

Metrics Collected

Token Usage: Input, output, cache creation, and cache read tokens
Cost: Estimated USD cost based on Claude pricing
Tool Calls: All tools used, with MCP tools tracked separately
Wall Time: Actual execution duration
Success Rate: Task completion status

Statistical Features

Confidence Intervals: 95% CI for token usage and cost (t-distribution for small samples)
Significance Testing: Welch's t-test for comparing scenarios
Effect Sizes: Cohen's d with labels (negligible/small/medium/large)
Outlier Detection: IQR-based outlier identification and filtering
Distribution Analysis: Histograms and percentiles

Output Examples

Terminal Table

╭──────────────┬──────┬────────────┬──────────┬──────────┬─────────╮
│ Scenario     │ Runs │ Avg Tokens │ Avg Cost │ Avg Time │ Success │
├──────────────┼──────┼────────────┼──────────┼──────────┼─────────┤
│ baseline     │ 5    │     125.3K │    $1.45 │   342.1s │    100% │
│ with-boarder │ 5    │      98.7K │    $1.12 │   287.3s │    100% │
╰──────────────┴──────┴────────────┴──────────┴──────────┴─────────╯

MCP Tool Detection

MCP Tool Usage:
  mcp__boarder__cut_from_source - 2 calls
  mcp__boarder__clear_buffer - 1 call

Logging

Logs are written to ~/.mcp-coder-bench/logs/ with rotation. Set RUST_LOG for console output:

RUST_LOG=debug mcp-coder-bench run

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please ensure:

All tests pass (cargo test)
Code is formatted (cargo fmt)
No clippy warnings (cargo clippy)