The Blind Oracle: How PredictionArena.ai Starves Its AI Forecasters

Abstract

PredictionArena.ai claims to benchmark AI forecasting, but our reverse-engineering reveals critical architectural limitations. Unlike Metaculus tournament winners who employ multi-agent ensembles with 4-5 data sources and explicit inside/outside view separation, PredictionArena appears limited to single-source Kalshi data with monolithic reasoning. Professional forecasters beat AI bots with p=0.00001 when architectures lack these foundations. We identify 6 specific gaps and provide implementation-ready recommendations.

Key Findings

Architecture Gap: Metaculus SOTA uses 5 models + 4-5 data sources; PredictionArena appears to use single models with Kalshi data only
Performance Gap: Professional forecasters beat ALL bots in Metaculus Q2 2025 (head-to-head: -20.03, p = 0.00001)[1]
Text-Only Handicap: LLMs perform 41.7% worse without visual market data (EMNLP 2025)[3]

In the race to build AI systems that can predict the future, one platform has caught attention: PredictionArena.ai, created by Arcada Labs. The promise is tantalizing—autonomous AI agents competing on real prediction markets, proving whether machines can outforecast humans.

But beneath the Next.js interface and real-time leaderboards lies a troubling reality. Through extensive reverse-engineering of their JavaScript bundles, API endpoints, and database schema, we've uncovered a fundamental architectural flaw: PredictionArena.ai starves its AI agents of the very data they need to make intelligent predictions.

Methodology & Transparency

Verified

• Supabase PostgreSQL backend
• Agent-runner framework (GitHub)
• Kalshi API integration
• cycles table with reasoning TEXT

Inferred

• No visible RAG in agent-runner
• No web research tools in bundles
• Text-only Kalshi data input
• No visible base rate analysis

Unknown

• /api/research implementation
• Server-side research pipelines
• Actual prompt engineering
• Calibration mechanisms

Architecture Analysis

Backend Infrastructure

extracted_config.jsjavascript

1// Extracted from layout-*.js bundle
2const SUPABASE_URL = &quot;https://artjdzrirbbezfjpfzyd.supabase.co&quot;
3const SUPABASE_ANON_KEY = &quot;eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...&quot;
4 
5// Real-time subscription configuration
6const realtimeConfig = {
7  eventsPerSecond: 10  // Rate limited
8}

Database Schema (Inferred)

schema.sqlsql

1-- cycles (prediction reasoning)
2CREATE TABLE cycles (
3  id UUID PRIMARY KEY,
4  created_at TIMESTAMP,
5  agent_id UUID REFERENCES prediction_arena_agents(id),
6  reasoning TEXT,          -- FREE TEXT BLOB
7  thinking_duration_ms INT -- Just a timer
8);

The Core Problem: Data Starvation

PredictionArena.ai's scaffolding provides LLMs with minimal context for autonomous research. Unlike sophisticated forecasting systems, PredictionArena appears to feed its agents only Kalshi market data—no web search, no news analysis, no historical base rates, no expert opinion aggregation.

LLMs are essentially “blind forecasters”—they cannot access real-time news, research historical precedents, or retrieve domain-specific context.

Figure 1|Architecture Comparison: Metaculus SOTA vs PredictionArena

Metaculus SOTA (Panshul42)

PredictionArena

Panshul42's Q2 2025 winning architecture employs 6-7 research stages with multi-source data fusion, contrasted with PredictionArena's apparent single-source, monolithic approach.Source: github.com/Panshul42/Forecasting_Bot_Q2

SOTA Comparison: Metaculus Q2 2025 Winners

The Metaculus Q2 2025 AI Forecasting Benchmark saw 96 bots compete on 300+ questions for $30,000 in prizes.[1] The winning architectures demonstrate what “state-of-the-art” actually means.

Winner: Panshul42 (Score: 5,899 | Prize: $7,550)

Multi-Stage Retrieval-Augmented Pipeline

Question parsing and context extraction
Search query generation via LLMs
Asynchronous information retrieval from 4+ sources
Content synthesis and filtering
Parallel forecast generation across 5-agent committee
Ensemble aggregation with weighted averaging

5-Agent Forecasting Committee

2× Claude 3.7 Sonnet (inside view reasoning)
2× GPT-o4-mini (outside view reasoning)
1× GPT-o3 (double-weighted)

Direct Comparison

Metric	Metaculus SOTA	PredictionArena
Data Sources	4-5 (AskNews, Perplexity, Search APIs, Web Scraping)	~1 (Kalshi market data only)
Research Pipeline	6-7 structured stages	Monolithic reasoning blob
Model Ensemble	5 models with weighted averaging	Single model per agent
Inside/Outside View	Explicitly separated	Combined/unclear
Aggregation Method	Filtered median across models	None visible
Calibration Mechanism	Historical performance integration	None visible

Research Evidence

The following visualizations present data from peer-reviewed academic research, not subjective assessments.

Figure 2|Professional Forecasters vs AI Bots (Metaculus Q2 2025)

Q2 2025: -20.03 (95% CI: [-28.63, -11.41])

p = 0.00001 — All 10 Pros ranked above every bot

Head-to-head score showing bot disadvantage over four quarters. All 10 individual professional forecasters ranked above every bot.Source: EA Forum - Q2 AI Benchmark Results

Figure 3|LLM Calibration Error (Expected Calibration Error)

The Reasoning Paradox: GPT-5.2-XHigh with extended reasoning achieved ECE=0.395 (worst) and BSS=-0.799 (catastrophically negative).

Lower ECE indicates better calibration. Models with ECE > 0.2 show significant overconfidence issues that harm prediction accuracy.Source: KalshiBench Dec 2025

Figure 4|Text-Only vs Visual Market Data Performance

Performance Gap

41.7%

Visual representations outperform

text-only by nearly half

LLMs perform 41.7% worse when given text-only market data versus visual chart representations.Source: Agent Trading Arena (EMNLP 2025)

Figure 5|Memorization Contamination in LLM Forecasts

Stock Returns37% memorization

Earnings (CapEx)19% memorization

Average28% memorization

Kodak Case Study: When asked about July 28, 2020, the model recalled news from July 29 about the 318% surge—information recall from training data rather than reasoning.

MemorizationGenuine inference

19-37% of apparent LLM predictive power may be memorization of training data, not genuine inference.Source: arXiv Dec 2025

Architectural Gap Analysis

Based on our reverse-engineering, we identify six critical gaps between PredictionArena's observable architecture and SOTA systems:

Gap	SOTA Approach	PredictionArena
Multi-Source Data	4-5 concurrent APIs (AskNews, Perplexity, Google)	Kalshi market data only
Research Pipeline	6-7 step agentic workflow with structured outputs	Monolithic TEXT blob
Base Rate Analysis	BaseRateResearcher with historical probabilities	None visible
Model Ensemble	5 models with weighted averaging	Single model per agent
Inside/Outside View	Explicitly separated reasoning modes	Combined/unclear
Calibration	Overconfidence prompting, historical adjustment	None visible

Pragmatic Recommendations

Based on PredictionArena's existing architecture (Supabase + agent-runner + Kalshi API), here are implementation-ready improvements:

1. Structured Research Pipeline

recommended_schema.tstypescript

1// Replace monolithic TEXT with structured JSONB
2interface StructuredReasoning {
3  outside_view: {
4    reference_class: string;
5    base_rate: number;
6    historical_events: string[];
7  };
8  inside_view: {
9    causal_factors: string[];
10    mechanistic_analysis: string;
11  };
12  data_sources: string[];
13  confidence_calibration: number;
14  thinking_duration_ms: number;
15}
16 
17// ALTER TABLE cycles ADD COLUMN research_artifacts JSONB;

2. Multi-Source Data Integration

research_tools.pypython

1# Add to agent-runner/tools/
2research_sources = {
3    &quot;perplexity&quot;: &quot;llama-3.1-sonar-huge-128k-online&quot;,
4    &quot;asknews&quot;: &quot;max_news + historical_articles&quot;,
5    &quot;web_search&quot;: &quot;Google/Bing/DuckDuckGo&quot;,
6}
7 
8# EMNLP 2025: This alone could improve performance by 41.7%

3. Calibration Prompting

system_prompt.txttext

1WARNING: You have historically been overconfident.
2Base rate for Kalshi positive resolutions: ~45%
3Category-specific adjustments:
4  - Elections: Focus on polling aggregates
5  - Market prices: Consider momentum + fundamentals
6  - Disease outbreaks: Weight epidemiological models

Conclusion

PredictionArena.ai asks a compelling question: “Can models predict the future?” Our analysis suggests the platform may be handicapping its agents through architectural constraints—but we acknowledge significant uncertainty about server-side capabilities we cannot observe.

The broader research picture is sobering: even with sophisticated scaffolding, professional forecasters beat every AI bot in Metaculus Q2 2025 (p=0.00001).[1] Extended reasoning can actually worsencalibration. And 19-37% of apparent LLM “predictive power” may be memorization contamination, not genuine inference.

The challenge may not be just architecture—it may be fundamental limitations in how LLMs reason about uncertainty and the future.

We offer this analysis not as a takedown, but as a technical contribution. We invite the PredictionArena team to respond, and we commit to updating this analysis based on any corrections or clarifications they provide.

References

Metaculus Research Team (2025). Q2 AI Benchmark Results: Pros Maintain Clear Lead. Effective Altruism Forum. [Link]Accessed January 2026.
ForecastBench Team (2025). How Well Can Large Language Models Predict the Future?. Forecasting Research Substack. [Link]Accessed January 2026.
Chen et al. (2025). Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents. EMNLP 2025. [Link]Accessed January 2026.
Zhang et al. (2025). KalshiBench: Do LLMs Know What They Don't Know?. arXiv:2512.16030. [Link]Accessed January 2026.
Panshul42 (2025). Forecasting Bot Q2: Metaculus Tournament Winner. GitHub Repository. [Link]Accessed January 2026.
Metaculus (2025). Forecasting Tools Framework. GitHub Repository. [Link]Accessed January 2026.
Xu et al. (2025). LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena. arXiv:2510.17638. [Link]Accessed January 2026.