Futureshow
"FutureShow: Can AI Predict the Future? Live Real-World Forecasting"
Installation
npx futureshowAsk AI about Futureshow
Powered by Claude Β· Grounded in docs
I know everything about Futureshow. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
FutureShow: Can AI Predict the Future?
βοΈ AI Battle Arena: Competing to Predict Real-World Events
| π Live Battle Rankings | π― Real-World Forecasting | β‘ Prediction Markets |
Live Demo Β· δΈζζζ‘£ Β· Report Bug
π Current Championship Leaderboard π
Click Here: AI Live Future Forecasting
| Rank | Model | Correct/Total | Accuracy | Human Acc | vs Human | Pred Value |
|---|---|---|---|---|---|---|
| π₯ 1 | DeepSeek | 7535/7895 | 95.4% | 97.2% | -1.8% | +0.020 |
| π₯ 2 | GPT-5 | 8010/8661 | 92.5% | 96.9% | -4.5% | -0.041 |
| π₯ 3 | Gemini | 7717/8837 | 87.3% | 97.3% | -9.9% | -0.216 |
* Each model may generate different numbers of predictions due to varying prediction intervals.
* Human accuracy is calculated using the same prediction points as the corresponding model for fair comparison.
π Round 1 Complete β Results above are from events resolved before end of 2025. Round 2 is now in progress!
π Metrics Explanation
| Metric | Description |
|---|---|
| Correct | Number of correct predictions relative to total predictions made on real-world events. |
| Accuracy | Prediction Accuracy: (Correct Predictions / Total Predictions) Γ 100% |
| Human Acc | Market Consensus Baseline: Accuracy of crowd wisdom at identical prediction points. Human predictions are derived as YES when market probability > 50%, otherwise NO, representing the collective "Wisdom of the Crowd" benchmark |
| vs Human | AI forecasting performance against crowd wisdom |
| Pred Value | Prediction Value (log-return method): Measures the model's value generation beyond market consensus. |
Prediction Value Formula
If prediction is CORRECT: Value = -log(p)
If prediction is INCORRECT: Value = log(p)
where p = market probability for the predicted outcome at prediction time
Interpretation Guide:
| Value Range | Market Prob (p) | Meaning |
|---|---|---|
| +0.1 ~ +0.7 | 50% ~ 90% | Small gain. Model correctly predicted what the market also favored. |
| +0.7 ~ +2.3 | 10% ~ 50% | Moderate gain. Model correctly made a contrarian prediction. |
| +2.3 ~ +6.9 | 0.1% ~ 10% | Exceptional gain. Model correctly predicted a very unlikely outcome. |
| -0.1 ~ -0.7 | 50% ~ 90% | Minor loss. Model followed market consensus but both were wrong. |
| -0.7 ~ -2.3 | 10% ~ 50% | Moderate loss. Model made a contrarian prediction that failed. |
| -2.3 ~ -6.9 | 0.1% ~ 10% | Severe loss. Model predicted a very unlikely outcome and was wrong. |
Theoretical Bounds: Value ranges from -6.9 to +6.9, based on probability clamp [0.001, 0.999]. In practice, most values fall within Β±2.3 (p between 10% and 90%).
The displayed Prediction Value is the Average across all predictions. Positive values indicate the model outperforms market consensus; negative values indicate underperformance.
π Table of Contents
- π Our Mission
- π― What is FutureShow?
- β¨ Key Features
- πΌοΈ Screenshots
- ποΈ System Architecture
- π Quick Start
- π§ MCP Tools Reference
- π Forecasting Pipeline
- π Dashboard & API
- βοΈ Advanced Configuration
- π Data Formats & Output
- π οΈ Development
- π€ Contributing
π Our Mission
Can AI Agents Outthink the Wisdom of the Crowd?
π§ The Foundation: Human Collective Intelligence
Prediction markets represent humanity's most sophisticated mechanism for aggregating collective intelligence. When thousands of participants stake real money on future outcomes, their combined judgment distills into remarkably accurate probability estimates. This "wisdom of the crowd" has consistently outperformed individual experts across virtually every domain.
π¬ Our Approach: Real-World Testing
FutureShow conducts a transparent, ongoing experiment:
- βοΈ Direct Competition: Frontier AI models vs. market consensus
- π Rigorous Methodology: Every prediction timestamped, every outcome independently verified
- βοΈ Fair Comparison: Identical decision points, identical timeframes
- π« Zero Bias: No cherry-picking, no hindsight adjustments
π Beyond the Leaderboard
This study investigates AI boundaries beyond performance tracking:
- β Where AI excels in prediction accuracy
- β Where AI systematically fails against human crowds
- π° Whether machines can generate alpha against aggregated human wisdom
π― What is FutureShow?
Can AI agents predict the future better than human crowds betting real money?
FutureShow is an Open-Source AI Benchmarking platform that puts this question to the ultimate test. We evaluate frontier AI Models against prediction markets β where thousands of participants stake real money on future outcomes, creating some of the most accurate probability estimates available.
How It Works
Our system operates as a continuous, real-world experiment:
π Market Intelligence
- Monitors live prediction markets on Polymarket
- Tracks events spanning politics, economics, tech, sports, and culture
π€ AI Agent Deployment
- Deploys multiple frontier models (GPT-5, Claude, Gemini, DeepSeek)
- Each agent analyzes identical market conditions independently
π Real-Time Research
- Agents gather intelligence via web search, news, Reddit, and Twitter
- No human intervention β pure AI reasoning and research
π Transparent Tracking
- Records each model's YES/NO predictions with full reasoning
- Tracks accuracy as real events unfold
- Maintains live performance leaderboard
Why This Matters
-
π² Prediction markets aren't just betting β they're humanity's most sophisticated mechanism for aggregating collective intelligence. When people risk real money, their combined judgment creates remarkably accurate forecasts that consistently outperform individual experts.
-
π§ This makes them perfect AI benchmarks β objective, real-time, and impossible to game. No synthetic datasets, no contrived scenarios. Just AI versus the wisdom of crowds, measured transparently.
β¨ Key Features
π€ Multi-Model Agent Arena
FutureShow supports any LLM accessible via LiteLLM, including:
| Provider | Models | Configuration |
|---|---|---|
| OpenAI | GPT-4o, GPT-5 | openai/gpt-5 |
| Anthropic | Claude 4.5 Sonnet, Claude Opus | anthropic/claude-sonnet-4.5 |
| Gemini 2.5 Pro, Gemini Ultra | google/gemini-2.5-pro | |
| DeepSeek | DeepSeek-V3, DeepSeek-R1 | deepseek/deepseek-chat-v3.1 |
| OpenRouter | 100+ models | openrouter/provider/model |
Each model runs as an independent agent with:
- Dedicated tool access (search, market data, reasoning)
- Isolated position/PnL tracking
- Persistent session state via SQLite
- Configurable max steps, retries, and delays
π Real-Time Market Intelligence
Agents have access to comprehensive MCP (Model Context Protocol) tools:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π§ MCP Tool Suite β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π Market Data β π Web Search β π¬ Social β
β ββ list_events β ββ google_web β ββ reddit β
β ββ list_markets β ββ google_news β ββ twitter β
β ββ get_market_info β ββ exa_semantic β β
β ββ get_market_prices β β πΉ Trading β
β ββ get_market_history β π’ Utilities β ββ buy β
β β ββ math_tool β ββ sell β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Live Leaderboard & Dashboard
- Real-time accuracy tracking across all resolved markets
- Per-model breakdowns with correct/total/abstain counts
- Category-wise performance (Politics, Crypto, Sports, etc.)
- Historical forecast browsing with full reasoning trails
π Simulated Trading Engine
FutureShow includes a realistic trading simulation:
- Order book simulation using live Polymarket CLOB data
- Slippage modeling with configurable liquidity impact
- Position tracking with JSONL ledger persistence
- PnL calculation with NAV (Net Asset Value) history
πΌοΈ Screenshots
π Forecasts Overview Main dashboard showing all prediction markets. Each card displays event title, market probability, and model predictions with colored icons indicating YES/NO votes. |
π Event Detail Page Deep dive into a specific market with full prediction history, AI reasoning trails, probability charts, and final outcomes for closed events. |
π Model Leaderboard Competitive rankings showing accuracy, human baseline comparison, and Prediction Value β measuring how much alpha each model generates vs market consensus. |
β‘ Batch Prediction in Action Watch multiple AI agents analyze markets in parallel with real-time logging, concurrent execution, and automatic result persistence. |
ποΈ System Architecture
π Quick Start
1οΈβ£ Environment Setup
# Clone the repository
git clone https://github.com/HKUDS/FutureShow.git
cd FutureShow
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install with dev dependencies
pip install -e .[dev]
2οΈβ£ API Key Configuration
Copy the example environment file and fill in your API keys:
cp .env.example .env
Edit .env with your credentials:
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# LLM Provider API Keys (configure at least one)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DEEPSEEK_API_KEY="sk-xxx" # DeepSeek models
DEEPSEEK_BASE_URL="https://api.deepseek.com/v1"
OPENROUTER_API_BASE="https://openrouter.ai/api/v1"
OPENROUTER_API_KEY="sk-or-xxx" # Access 100+ models via OpenRouter
OPENAI_API_BASE="https://api.openai.com/v1" # Or custom endpoint
OPENAI_API_KEY="sk-xxx" # OpenAI GPT models
# Optional: Additional LLM providers
PRIVATE_API_BASE="" # Custom LLM endpoint
PRIVATE_API_KEY=""
LITE_API_BASE="" # LiteLLM proxy endpoint
LITE_API_KEY=""
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Search & Intelligence Tools
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SERPER_API_KEY="xxx" # Google Search via Serper.dev
EXA_API_KEY="xxx" # Exa semantic search
RAPIDAPI_KEY="xxx" # RapidAPI for additional services
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Polymarket (optional, for trading mode)
# See "How to Get Polymarket Credentials" below
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
POLYMARKET_API_KEY="" # API key from Polymarket
PRIVATE_KEY="" # Your wallet private key
KEY="" # Same as PRIVATE_KEY
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Agent Configuration
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AGENT_MAX_STEP=30 # Max reasoning steps per agent
RUNTIME_ENV_PATH=".runtime_env.json" # Runtime state file
DEBUG=1 # Debug mode (1=enabled, 0=disabled)
π How to Get Polymarket Credentials (for Trading Mode)
Note: These credentials are only required for Live Trading Mode. The forecasting benchmark works without them.
Step 1: Get Your Wallet Private Key (PRIVATE_KEY & KEY)
- Create an Ethereum-compatible wallet (e.g., MetaMask)
- Fund it with MATIC on Polygon network for transaction fees
- Export your private key:
- MetaMask: Settings β Security & Privacy β Reveal Secret Recovery Phrase (or export private key for specific account)
- β οΈ Never share your private key with anyone!
- Set both
PRIVATE_KEYandKEYto the same value (your wallet private key)
Step 2: Generate Polymarket API Key (POLYMARKET_API_KEY)
Use the provided script to generate your API credentials:
# Make sure PRIVATE_KEY is set in your .env file first
python futureshow/utils/generate_poly_apikey.py
This script uses py-clob-client to call create_or_derive_api_creds(), which derives your API key from your wallet signature.
Alternatively, generate via Polymarket UI:
- Go to Polymarket and connect your wallet
- Navigate to Settings β API
- Enable API trading and generate credentials
Resources
3οΈβ£ Run Forecasting Benchmark
Start the AI forecasting agents to predict Polymarket events:
# βββ Single Round βββ
# Run all enabled models once on current watchlist
python run_forecast_loop.py --once
# βββ Continuous Loop βββ
# Run predictions every 6 hours (default), refresh watchlist each round
python run_forecast_loop.py --refresh --interval 21600
# βββ Custom Configuration βββ
# Limit to 4 models, target specific month's events
python run_forecast_loop.py \
--limit 4 \
--month 1 \
--year 2025 \
--refresh
4οΈβ£ Track Results & Launch Dashboard
# Start event tracker (monitors market status & prices every 30 min)
python run_forecast_trackers.py --interval 1800 &
# Launch the forecasting dashboard
python web_server_pred.py
# Open http://localhost:10086
The dashboard displays:
- Forecasts page: All active/closed predictions with model votes
- Detail page: Full prediction history and AI reasoning for each event
- Leaderboard: Model accuracy rankings vs human baseline
π° Optional: Live Trading Mode
Enable simulated trading with PnL tracking
For advanced users who want to run live trading simulations.
Prerequisites: Configure POLYMARKET_API_KEY, PRIVATE_KEY, and KEY in your .env file.
# βββ Run Trading Agents βββ
# Single round with trading enabled
python main.py configs/default_config.json
# Continuous trading loop (every 40 minutes)
python run_agents_loop.py \
--interval 2400 \
--overrun-pause 900 \
--config configs/default_config.json
# βββ Track PnL & Launch Trading Dashboard βββ
# Start PnL tracking (updates every 10 seconds)
python run_pnl_trackers.py --interval 10 --config configs/default_config.json &
# Launch trading dashboard
python web_server.py
# Open http://localhost:10032
π§ MCP Tools Reference
FutureShow provides agents with these Model Context Protocol tools:
π Polymarket Data Tools
| Tool | Function | Parameters | Returns |
|---|---|---|---|
list_events | List active events with category balancing | query, tags_any, tags_all, exclude_tags, categories, limit, per_category, detailed | Formatted event list with probability, volume, category |
list_markets | List markets with filters | query, tags_any, only_open, only_active, sort, trending_only, min_liquidity, limit | Market objects with prices |
get_polymarket_info_by_slug | Get market/event details | slug | Full market or event object with outcomes, prices |
get_market_prices | Get current prices | market_slug | {outcome: price} mapping |
get_market_history | Get price history | market_slug, interval | Historical price series per outcome |
Example: list_events output
01. trump-2028 | p=0.234 | vol=1523000.0 | OI=892341 | cat=US Politics | Will Trump run in 2028?
tags: Politics, Elections, Trump
time: end=2028-11-15T00:00:00Z | updated=2025-01-20T12:00:00Z
liq: 45000 | comments=234
market0: slug=trump-2028-yes | outcomes=['Yes', 'No'] | prices=[0.234, 0.766] | mid=0.234
02. btc-100k-jan | p=0.891 | vol=982000.0 | OI=456123 | cat=Crypto | Bitcoin above $100k by Jan 31?
...
π Search Tools
| Tool | Source | Parameters | Returns |
|---|---|---|---|
google_web_search | Google via Serper | query, num_results, location, hl, gl | Formatted results with Knowledge Graph, Answer Box, organic results |
google_news_search | Google News via Serper | query, num_results, hl, gl | News articles with title, snippet, source, date |
google_url2text | Jina AI | url | Extracted article text |
reddit_search | Reddit API | query, subreddit, sort, limit | Post titles, scores, comments |
reddit_post_details | Reddit API | post_id | Full post with top comments |
search_tweets | Twitter/X API | query, max_results | Recent tweets with engagement |
πΉ Trading Simulation Tools
| Tool | Action | Parameters | Effect |
|---|---|---|---|
buy | Purchase shares | market_slug, outcome, cost_usd | Deduct cash, add shares, simulate slippage |
sell | Sell shares | market_slug, outcome, shares | Add cash, remove shares, simulate slippage |
settle | Settle closed market | market_slug | Pay out winning positions at $1/share |
Trading simulation features
- Order Book Simulation: Fetches real CLOB data from Polymarket
- Slippage Modeling: Consumes liquidity levels based on order size
- Liquidity Overlay: Tracks consumed liquidity with decay over time
- Partial Fills: Handles insufficient liquidity gracefully
- JSONL Ledger: All trades recorded with full execution details
π’ Utility Tools
| Tool | Function | Parameters |
|---|---|---|
math_tool | Evaluate mathematical expressions | expression |
π Forecasting Pipeline
Agent Workflow
Prediction Format
Agents output predictions in a structured format:
<PREDICTION>market-slug|YES</PREDICTION>
Or for binary markets without explicit slug:
<PREDICTION>YES</PREDICTION>
Supported values: YES, NO, ABSTAIN
π Dashboard & API
Web Server
python web_server.py
# Serves on http://0.0.0.0:10032 by default
Environment variables:
WEB_HOST: Bind address (default:0.0.0.0)WEB_PORT: Port number (default:10032)
REST API Endpoints
| Endpoint | Method | Description | Parameters |
|---|---|---|---|
/api/status | GET | System status, available models | signature |
/api/models | GET | List all model signatures | - |
/api/positions | GET | Latest positions & trades | signature |
/api/pnl | GET | PnL history for date | signature, date, full |
/api/messages | GET | Agent reasoning logs | signature |
/api/polymarket_info | GET | Proxy to Polymarket data | slug |
Example API Response: /api/pnl
{
"ok": true,
"signature": "gpt-5",
"date": "2025-01-20",
"times": ["2025-01-20T00:00:00Z", "2025-01-20T01:00:00Z", ...],
"nav": [10000.0, 10023.45, 10089.12, ...],
"returns": [0.0, 0.23, 0.89, ...],
"latest": {
"timestamp": "2025-01-20T23:59:00Z",
"nav": 10234.56,
"cash": 5234.56,
"positions_value": 5000.0
},
"count": 24,
"full": false
}
βοΈ Advanced Configuration
Config File Structure
{
"agent_type": "PolymarketAgent",
"date_range": {
"init_date": "2025-01-01",
"end_date": "2025-12-31"
},
"agent_config": {
"max_steps": 50, // Max tool calls per event
"max_retries": 3, // Retry on transient failures
"base_delay": 0.5, // Retry backoff base (seconds)
"initial_cash": 10000.0 // Starting cash for simulation
},
"log_config": {
"log_path": "./data/agent_data"
},
"models": [
{
"name": "gpt-5",
"basemodel": "openai/gpt-5",
"signature": "gpt-5",
"enabled": true,
"provider": "openai"
},
{
"name": "claude-4.5-sonnet",
"basemodel": "openrouter/anthropic/claude-sonnet-4.5",
"signature": "claude-4.5-sonnet",
"enabled": true,
"provider": "openrouter"
},
{
"name": "gemini-2.5-pro",
"basemodel": "openrouter/google/gemini-2.5-pro",
"signature": "gemini-2.5-pro",
"enabled": true,
"provider": "openrouter"
},
{
"name": "deepseek-v3.1",
"basemodel": "openrouter/deepseek/deepseek-chat-v3.1",
"signature": "deepseek-v3.1",
"enabled": true,
"provider": "openrouter"
}
]
}
Runtime Environment
The system writes .runtime_env.json to coordinate state:
{
"SIGNATURE": "gpt-5",
"CURRENT_DATETIME": "2025-01-20T15:30:00Z",
"INIT_DATETIME": "2025-01-01T00:00:00Z",
"IF_TRADE": false
}
Watchlist Management
Edit futureshow/utils/polymarket_watchlist.json or use API:
from futureshow.utils.polymarket_watchlist import (
refresh_trending_watchlist,
load_watchlist,
add_events_to_watchlist,
remove_events_from_watchlist,
)
# Refresh with trending events
refresh_trending_watchlist(year=2025, month=1)
# Manual additions
add_events_to_watchlist(["custom-event-slug"])
π Data Formats & Output
Directory Structure
data/
βββ agent_data/
β βββ {model_signature}/
β βββ position/
β β βββ position.jsonl # Trade ledger
β β βββ liquidity.json # Simulated liquidity state
β βββ pnl/
β β βββ intraday_{date}.jsonl # NAV snapshots
β βββ log/
β βββ {date}/
β βββ log.jsonl # Agent reasoning traces
β
βββ forecasts/
β βββ {model_signature}/
β βββ {event_slug}/
β βββ forecasts.jsonl # Predictions over time
β βββ tracking.jsonl # Market state snapshots
β βββ result.json # Final resolution
β
βββ cache/
βββ polymarket_markets/ # API response cache
βββ {slug}.json
Position Ledger Format (position.jsonl)
{
"timestamp": "2025-01-20T15:30:00Z",
"id": 42,
"this_action": {
"action": "buy",
"market": "btc-100k-jan",
"outcome": "Yes",
"requested_cost": 1000.0,
"spent": 998.45,
"shares": 1123.5,
"avg_price": 0.889,
"partial_fill": false,
"levels": [
{"price": 0.888, "shares": 500, "cost": 444.0},
{"price": 0.890, "shares": 623.5, "cost": 554.45}
]
},
"positions": {
"CASH": 4001.55,
"btc-100k-jan:Yes": 1123.5,
"trump-2028:No": 500.0
}
}
Forecast Record Format (forecasts.jsonl)
{
"timestamp": "2025-01-20T15:30:00Z",
"signature": "gpt-5",
"event_slug": "btc-100k-jan",
"event_title": "Bitcoin above $100k by Jan 31?",
"forecast": "Based on current momentum and institutional inflows...\n\n<PREDICTION>btc-100k-jan-yes|YES</PREDICTION>",
"predictions": [
{"slug": "btc-100k-jan-yes", "outcome": "YES"}
]
}
π οΈ Development
Running Tests
# All tests
pytest -q
# Specific module
pytest tests/test_polymarket_data.py -v
# With coverage
pytest --cov=futureshow --cov-report=html
Code Quality
# Lint
ruff check futureshow tests
# Format
ruff format futureshow tests
# Type check
mypy futureshow
Project Structure
FutureShow/
βββ futureshow/ # π― Core package
β βββ agent/ # Agent implementations
β β βββ __init__.py
β β βββ polymarket/
β β βββ polymarket_agent.py # Trading agent
β β βββ polymarket_forecast_agent.py # Forecast-only agent
β β βββ market_preview.py # Market analysis utils
β βββ prompt/ # System prompts
β β βββ polymarket_agent_prompt.py
β β βββ polymarket_forecast_prompt.py
β βββ tool/ # MCP tools (FastMCP + function_tool)
β β βββ tool_polymarket_data.py # Market data (1170 lines)
β β βββ tool_polymarket_trade.py # Trading simulation (655 lines)
β β βββ tool_google.py # Serper search
β β βββ tool_exa.py # Semantic search
β β βββ tool_reddit.py # Reddit API
β β βββ tool_twitter.py # X/Twitter API
β β βββ tool_math.py # Math evaluation
β βββ utils/ # Helpers
β βββ agent_logs.py # Logging hooks
β βββ general_tools.py # Config helpers
β βββ polymarket_watchlist.py # Watchlist management
β βββ polymarket_position_tools.py
β
βββ frontend/ # π₯οΈ Web dashboard
β βββ index.html
β βββ app.js # Chart.js + fetch API
β βββ styles.css
β βββ icons/ # Model logos
β
βββ configs/ # βοΈ Configuration
β βββ default_config.json
β
βββ tests/ # π§ͺ pytest suite
β βββ conftest.py
β βββ test_polymarket_data.py
β βββ test_polymarket_trade.py
β βββ ...
β
βββ main.py # Entry point
βββ web_server.py # Dashboard server (358 lines)
βββ run_agents_once.py # Single-pass runner
βββ run_agents_loop.py # Continuous runner
βββ run_pnl_trackers.py # PnL tracking loop
βββ run_forecast_loop.py # Forecast-only loop
π€ Contributing
We welcome contributions! Here's how:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
Guidelines
- Follow existing code style (ruff formatting)
- Add tests for new tools/agents
- Update documentation for API changes
- Include sample output for new features
π License
This project is licensed under the MIT License - see LICENSE for details.
π Found FutureShow useful? Star us on GitHub!
Built with curiosity by HKUDS
Thanks for visiting β¨ FutureShow!
