Version 1.0

Methodology

A rigorous benchmark for evaluating LLM forecasting capabilities using real prediction markets.

AAbstract

Forecaster Arena is a benchmark that tests Large Language Model forecasting capabilities using real prediction markets from Polymarket. Unlike traditional benchmarks that may be contaminated by training data, this system evaluates genuine predictive reasoning about future events that cannot exist in any training corpus.

1. Introduction

1.1 Motivation

Traditional LLM benchmarks face a fundamental challenge: models may have been trained on the very data used for evaluation. This leads to benchmark saturation and inflated performance metrics that don't reflect genuine reasoning capabilities.

1.2 The Problem with Traditional Benchmarks

  • xData contamination: Training data may contain benchmark answers
  • xMemorization vs. reasoning: High scores may reflect memorization, not understanding
  • xStatic nature: Benchmarks become stale as models improve

1.3 Reality as Benchmark

Prediction markets provide questions about future events, outcomes that cannot exist in any training data because they haven't happened yet. By having LLMs make forecasts on these markets, we evaluate their ability to reason about uncertainty, synthesize information, and make calibrated probability estimates.

2. System Design

2.1 Cohort System

Start frequencyEvery Sunday 00:00 UTC
Models per cohort7
Starting capital$10,000
DurationUntil all bets resolve

2.2 Participating Models

GPT-5.2(OpenAI)
Gemini 3 Pro(Google)
Grok 4.1(xAI)
Claude Opus 4.5(Anthropic)
DeepSeek V3.2(DeepSeek)
Kimi K2(Moonshot AI)
Qwen 3(Alibaba)

2.3 Market Selection

Markets are sourced from Polymarket's public API. We select the top 500 markets by trading volume to ensure liquidity and reliable price signals.

3. Decision Protocol

3.1 Information Provided

Each week, LLMs receive their portfolio state (cash balance, open positions) and market information (question, category, current price, volume, close date) for the top 500 markets.

3.2 Action Space

BETPlace a new bet on a market (specify market, side, amount)
SELLClose or reduce an existing position (specify percentage)
HOLDTake no action this week

3.3 Constraints

ConstraintValueRationale
Minimum bet$50Prevents noise from trivial bets
Maximum bet25% of balanceEncourages portfolio thinking
Positions per market1 per sideSimplifies tracking

4. Scoring Methodology

4.1 Brier Score

The Brier Score measures forecast accuracy. Lower is better (0 = perfect, 1 = worst).

Brier = (forecast - outcome)^2

Implied confidence is derived from bet size:

confidence = bet_amount / max_possible_bet

4.2 Portfolio Returns

Simple percentage return from the initial $10,000 starting balance. Both realized (from resolved bets) and unrealized (mark-to-market) gains are tracked.

5. Reproducibility

+

Full Prompt Storage

Complete system and user prompts stored for every decision.

+

Temperature = 0

Deterministic outputs for reproducibility.

+

Open Source

Complete codebase available for inspection and replication.

+

Versioned Methodology

Each cohort is tied to a specific methodology version.