llm-benchmark-mcp-server
MCP Server for LLM comparison, benchmarks, and pricing β find the best model for any task
Ask AI about llm-benchmark-mcp-server
Powered by Claude Β· Grounded in docs
I know everything about llm-benchmark-mcp-server. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
LLM Benchmark MCP Server
MCP server that gives AI agents access to LLM benchmark data, pricing comparisons, and model recommendations.
Features
- compare_models β Side-by-side benchmark comparison of LLMs (MMLU, HumanEval, MATH, GPQA, ARC, HellaSwag)
- get_model_details β Detailed info about a specific model including strengths/weaknesses
- recommend_model β Get the best model recommendation for your task and budget
- list_top_models β Top models ranked by category (coding, math, reasoning, chat)
- get_pricing β Pricing comparison via OpenRouter API
Supported Models
GPT-4o, GPT-4o-mini, GPT-4 Turbo, o1, o3-mini, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, Gemini 2.0 Flash, Gemini 2.0 Pro, Gemini 1.5 Pro, Llama 3.1 (8B/70B/405B), Llama 3.3 70B, Mistral Large, Mistral Small, Mixtral 8x22B, DeepSeek V3, DeepSeek R1, Qwen 2.5 72B
Installation
pip install llm-benchmark-mcp-server
Usage with Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"llm-benchmark": {
"command": "benchmark-server"
}
}
}
Or via uvx (no install needed):
{
"mcpServers": {
"llm-benchmark": {
"command": "uvx",
"args": ["llm-benchmark-mcp-server"]
}
}
}
Example Queries
- "Compare GPT-4o vs Claude 3.5 Sonnet vs Gemini 2.0 Pro"
- "Which model is best for coding on a low budget?"
- "Show me the top 10 models for math"
- "What does GPT-4o cost compared to Claude?"
- "Give me details about DeepSeek R1"
Data Sources
- Benchmarks: Hardcoded from official papers and public leaderboards (MMLU, HumanEval, MATH, GPQA, ARC-Challenge, HellaSwag)
- Pricing: Live data from OpenRouter API
- Arena Rankings: Chatbot Arena Leaderboard (when available)
More MCP Servers by AiAgentKarl
| Category | Servers |
|---|---|
| π Blockchain | Solana |
| π Data | Weather Β· Germany Β· Agriculture Β· Space Β· Aviation Β· EU Companies |
| π Security | Cybersecurity Β· Policy Gateway Β· Audit Trail |
| π€ Agent Infra | Memory Β· Directory Β· Hub Β· Reputation |
| π¬ Research | Academic Β· LLM Benchmark Β· Legal |
β Full catalog (40+ servers)
License
MIT
