Version 1.0

Methodology

A rigorous benchmark for evaluating LLM forecasting capabilities using real prediction markets.

AAbstract

Forecaster Arena is a benchmark that tests Large Language Model forecasting capabilities using real prediction markets from Polymarket. Unlike traditional benchmarks that may be contaminated by training data, this system evaluates genuine predictive reasoning about future events that cannot exist in any training corpus.

1. Introduction

1.1 Motivation

Traditional LLM benchmarks face a fundamental challenge: models may have been trained on the very data used for evaluation. This leads to benchmark saturation and inflated performance metrics that don't reflect genuine reasoning capabilities.

1.2 The Problem with Traditional Benchmarks

xData contamination: Training data may contain benchmark answers
xMemorization vs. reasoning: High scores may reflect memorization, not understanding
xStatic nature: Benchmarks become stale as models improve

1.3 Reality as Benchmark

Prediction markets provide questions about future events, outcomes that cannot exist in any training data because they haven't happened yet. By having LLMs make forecasts on these markets, we evaluate their ability to reason about uncertainty, synthesize information, and make calibrated probability estimates.

2. System Design

2.1 Cohort System

Start frequency	Every Sunday 00:00 UTC
Models per cohort	7
Starting capital	$10,000
Duration	Until all bets resolve

2.2 Participating Models

GPT-5.2(OpenAI)

Gemini 3 Pro(Google)

Grok 4.1(xAI)

Claude Opus 4.5(Anthropic)

DeepSeek V3.2(DeepSeek)

Kimi K2(Moonshot AI)

Qwen 3(Alibaba)

2.3 Market Selection

Markets are sourced from Polymarket's public API. We select the top 500 markets by trading volume to ensure liquidity and reliable price signals.

3. Decision Protocol

3.1 Information Provided

Each week, LLMs receive their portfolio state (cash balance, open positions) and market information (question, category, current price, volume, close date) for the top 500 markets.

3.2 Action Space

BETPlace a new bet on a market (specify market, side, amount)

SELLClose or reduce an existing position (specify percentage)

HOLDTake no action this week

3.3 Constraints

Constraint	Value	Rationale
Minimum bet	$50	Prevents noise from trivial bets
Maximum bet	25% of balance	Encourages portfolio thinking
Positions per market	1 per side	Simplifies tracking

4. Scoring Methodology

4.1 Brier Score

The Brier Score measures forecast accuracy. Lower is better (0 = perfect, 1 = worst).

Brier = (forecast - outcome)^2

Implied confidence is derived from bet size:

confidence = bet_amount / max_possible_bet

4.2 Portfolio Returns

Simple percentage return from the initial $10,000 starting balance. Both realized (from resolved bets) and unrealized (mark-to-market) gains are tracked.

5. Reproducibility

Full Prompt Storage

Complete system and user prompts stored for every decision.

Temperature = 0

Deterministic outputs for reproducibility.

Open Source

Complete codebase available for inspection and replication.

Versioned Methodology

Each cohort is tied to a specific methodology version.