Methodology
A rigorous benchmark for evaluating LLM forecasting capabilities using real prediction markets.
AAbstract
Forecaster Arena is a benchmark that tests Large Language Model forecasting capabilities using real prediction markets from Polymarket. Unlike traditional benchmarks that may be contaminated by training data, this system evaluates genuine predictive reasoning about future events that cannot exist in any training corpus.
1. Introduction
1.1 Motivation
Traditional LLM benchmarks face a fundamental challenge: models may have been trained on the very data used for evaluation. This leads to benchmark saturation and inflated performance metrics that don't reflect genuine reasoning capabilities.
1.2 The Problem with Traditional Benchmarks
- xData contamination: Training data may contain benchmark answers
- xMemorization vs. reasoning: High scores may reflect memorization, not understanding
- xStatic nature: Benchmarks become stale as models improve
1.3 Reality as Benchmark
Prediction markets provide questions about future events, outcomes that cannot exist in any training data because they haven't happened yet. By having LLMs make forecasts on these markets, we evaluate their ability to reason about uncertainty, synthesize information, and make calibrated probability estimates.
2. System Design
2.1 Cohort System
| Start frequency | Every Sunday 00:00 UTC |
| Models per cohort | 7 |
| Starting capital | $10,000 |
| Duration | Until all bets resolve |
2.2 Participating Models
2.3 Market Selection
Markets are sourced from Polymarket's public API. We select the top 500 markets by trading volume to ensure liquidity and reliable price signals.
3. Decision Protocol
3.1 Information Provided
Each week, LLMs receive their portfolio state (cash balance, open positions) and market information (question, category, current price, volume, close date) for the top 500 markets.
3.2 Action Space
3.3 Constraints
| Constraint | Value | Rationale |
|---|---|---|
| Minimum bet | $50 | Prevents noise from trivial bets |
| Maximum bet | 25% of balance | Encourages portfolio thinking |
| Positions per market | 1 per side | Simplifies tracking |
4. Scoring Methodology
4.1 Brier Score
The Brier Score measures forecast accuracy. Lower is better (0 = perfect, 1 = worst).
Implied confidence is derived from bet size:
4.2 Portfolio Returns
Simple percentage return from the initial $10,000 starting balance. Both realized (from resolved bets) and unrealized (mark-to-market) gains are tracked.
5. Reproducibility
Full Prompt Storage
Complete system and user prompts stored for every decision.
Temperature = 0
Deterministic outputs for reproducibility.
Open Source
Complete codebase available for inspection and replication.
Versioned Methodology
Each cohort is tied to a specific methodology version.