Technical Analysis

The Blind Oracle

How PredictionArena.ai's architecture starves its AI forecasters of the data they need to see the future

Performance Gap
-20.03
Head-to-head score
P-Value
0.00001
Statistically significant
Data Sources
1 vs 5
PA vs Metaculus SOTA
Published: January 2026~15 min read6 References
Abstract
PredictionArena.ai claims to benchmark AI forecasting, but our reverse-engineering reveals critical architectural limitations. Unlike Metaculus tournament winners who employ multi-agent ensembles with 4-5 data sources and explicit inside/outside view separation, PredictionArena appears limited to single-source Kalshi data with monolithic reasoning. Professional forecasters beat AI bots with p=0.00001 when architectures lack these foundations. We identify 6 specific gaps and provide implementation-ready recommendations.
Key Findings
  • Architecture Gap: Metaculus SOTA uses 5 models + 4-5 data sources; PredictionArena appears to use single models with Kalshi data only
  • Performance Gap: Professional forecasters beat ALL bots in Metaculus Q2 2025 (head-to-head: -20.03, p = 0.00001)[1]
  • Text-Only Handicap: LLMs perform 41.7% worse without visual market data (EMNLP 2025)[3]

In the race to build AI systems that can predict the future, one platform has caught attention: PredictionArena.ai, created by Arcada Labs. The promise is tantalizing—autonomous AI agents competing on real prediction markets, proving whether machines can outforecast humans.

But beneath the Next.js interface and real-time leaderboards lies a troubling reality. Through extensive reverse-engineering of their JavaScript bundles, API endpoints, and database schema, we've uncovered a fundamental architectural flaw: PredictionArena.ai starves its AI agents of the very data they need to make intelligent predictions.

Methodology & Transparency

Verified

  • • Supabase PostgreSQL backend
  • • Agent-runner framework (GitHub)
  • • Kalshi API integration
  • • cycles table with reasoning TEXT

Inferred

  • • No visible RAG in agent-runner
  • • No web research tools in bundles
  • • Text-only Kalshi data input
  • • No visible base rate analysis

Unknown

  • • /api/research implementation
  • • Server-side research pipelines
  • • Actual prompt engineering
  • • Calibration mechanisms

Architecture Analysis

Backend Infrastructure

extracted_config.jsjavascript
1// Extracted from layout-*.js bundle
2const SUPABASE_URL = "https://artjdzrirbbezfjpfzyd.supabase.co"
3const SUPABASE_ANON_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
4
5// Real-time subscription configuration
6const realtimeConfig = {
7 eventsPerSecond: 10 // Rate limited
8}

Database Schema (Inferred)

schema.sqlsql
1-- cycles (prediction reasoning)
2CREATE TABLE cycles (
3 id UUID PRIMARY KEY,
4 created_at TIMESTAMP,
5 agent_id UUID REFERENCES prediction_arena_agents(id),
6 reasoning TEXT, -- FREE TEXT BLOB
7 thinking_duration_ms INT -- Just a timer
8);

The Core Problem: Data Starvation

PredictionArena.ai's scaffolding provides LLMs with minimal context for autonomous research. Unlike sophisticated forecasting systems, PredictionArena appears to feed its agents only Kalshi market data—no web search, no news analysis, no historical base rates, no expert opinion aggregation.

LLMs are essentially “blind forecasters”—they cannot access real-time news, research historical precedents, or retrieve domain-specific context.
Figure 1|Architecture Comparison: Metaculus SOTA vs PredictionArena
Metaculus SOTA (Panshul42)
PredictionArena
Panshul42's Q2 2025 winning architecture employs 6-7 research stages with multi-source data fusion, contrasted with PredictionArena's apparent single-source, monolithic approach.Source: github.com/Panshul42/Forecasting_Bot_Q2

SOTA Comparison: Metaculus Q2 2025 Winners

The Metaculus Q2 2025 AI Forecasting Benchmark saw 96 bots compete on 300+ questions for $30,000 in prizes.[1] The winning architectures demonstrate what “state-of-the-art” actually means.

Winner: Panshul42 (Score: 5,899 | Prize: $7,550)

Multi-Stage Retrieval-Augmented Pipeline

  1. Question parsing and context extraction
  2. Search query generation via LLMs
  3. Asynchronous information retrieval from 4+ sources
  4. Content synthesis and filtering
  5. Parallel forecast generation across 5-agent committee
  6. Ensemble aggregation with weighted averaging

5-Agent Forecasting Committee

  • Claude 3.7 Sonnet (inside view reasoning)
  • GPT-o4-mini (outside view reasoning)
  • GPT-o3 (double-weighted)

Direct Comparison

MetricMetaculus SOTAPredictionArena
Data Sources4-5 (AskNews, Perplexity, Search APIs, Web Scraping)~1 (Kalshi market data only)
Research Pipeline6-7 structured stagesMonolithic reasoning blob
Model Ensemble5 models with weighted averagingSingle model per agent
Inside/Outside ViewExplicitly separatedCombined/unclear
Aggregation MethodFiltered median across modelsNone visible
Calibration MechanismHistorical performance integrationNone visible

Research Evidence

The following visualizations present data from peer-reviewed academic research, not subjective assessments.

Figure 2|Professional Forecasters vs AI Bots (Metaculus Q2 2025)

Q2 2025: -20.03 (95% CI: [-28.63, -11.41])

p = 0.00001 — All 10 Pros ranked above every bot

Head-to-head score showing bot disadvantage over four quarters. All 10 individual professional forecasters ranked above every bot.Source: EA Forum - Q2 AI Benchmark Results
Figure 3|LLM Calibration Error (Expected Calibration Error)

The Reasoning Paradox: GPT-5.2-XHigh with extended reasoning achieved ECE=0.395 (worst) and BSS=-0.799 (catastrophically negative).

Lower ECE indicates better calibration. Models with ECE > 0.2 show significant overconfidence issues that harm prediction accuracy.Source: KalshiBench Dec 2025
Figure 4|Text-Only vs Visual Market Data Performance

Performance Gap

41.7%

Visual representations outperform

text-only by nearly half

LLMs perform 41.7% worse when given text-only market data versus visual chart representations.Source: Agent Trading Arena (EMNLP 2025)
Figure 5|Memorization Contamination in LLM Forecasts
Stock Returns37% memorization
Earnings (CapEx)19% memorization
Average28% memorization

Kodak Case Study: When asked about July 28, 2020, the model recalled news from July 29 about the 318% surge—information recall from training data rather than reasoning.

MemorizationGenuine inference
19-37% of apparent LLM predictive power may be memorization of training data, not genuine inference.Source: arXiv Dec 2025

Architectural Gap Analysis

Based on our reverse-engineering, we identify six critical gaps between PredictionArena's observable architecture and SOTA systems:

GapSOTA ApproachPredictionArena
Multi-Source Data4-5 concurrent APIs (AskNews, Perplexity, Google)Kalshi market data only
Research Pipeline6-7 step agentic workflow with structured outputsMonolithic TEXT blob
Base Rate AnalysisBaseRateResearcher with historical probabilitiesNone visible
Model Ensemble5 models with weighted averagingSingle model per agent
Inside/Outside ViewExplicitly separated reasoning modesCombined/unclear
CalibrationOverconfidence prompting, historical adjustmentNone visible

Pragmatic Recommendations

Based on PredictionArena's existing architecture (Supabase + agent-runner + Kalshi API), here are implementation-ready improvements:

1. Structured Research Pipeline

recommended_schema.tstypescript
1// Replace monolithic TEXT with structured JSONB
2interface StructuredReasoning {
3 outside_view: {
4 reference_class: string;
5 base_rate: number;
6 historical_events: string[];
7 };
8 inside_view: {
9 causal_factors: string[];
10 mechanistic_analysis: string;
11 };
12 data_sources: string[];
13 confidence_calibration: number;
14 thinking_duration_ms: number;
15}
16
17// ALTER TABLE cycles ADD COLUMN research_artifacts JSONB;

2. Multi-Source Data Integration

research_tools.pypython
1# Add to agent-runner/tools/
2research_sources = {
3 "perplexity": "llama-3.1-sonar-huge-128k-online",
4 "asknews": "max_news + historical_articles",
5 "web_search": "Google/Bing/DuckDuckGo",
6}
7
8# EMNLP 2025: This alone could improve performance by 41.7%

3. Calibration Prompting

system_prompt.txttext
1WARNING: You have historically been overconfident.
2Base rate for Kalshi positive resolutions: ~45%
3Category-specific adjustments:
4 - Elections: Focus on polling aggregates
5 - Market prices: Consider momentum + fundamentals
6 - Disease outbreaks: Weight epidemiological models

Conclusion

PredictionArena.ai asks a compelling question: “Can models predict the future?” Our analysis suggests the platform may be handicapping its agents through architectural constraints—but we acknowledge significant uncertainty about server-side capabilities we cannot observe.

The broader research picture is sobering: even with sophisticated scaffolding, professional forecasters beat every AI bot in Metaculus Q2 2025 (p=0.00001).[1] Extended reasoning can actually worsencalibration. And 19-37% of apparent LLM “predictive power” may be memorization contamination, not genuine inference.

The challenge may not be just architecture—it may be fundamental limitations in how LLMs reason about uncertainty and the future.

We offer this analysis not as a takedown, but as a technical contribution. We invite the PredictionArena team to respond, and we commit to updating this analysis based on any corrections or clarifications they provide.

References

  1. Metaculus Research Team (2025). Q2 AI Benchmark Results: Pros Maintain Clear Lead. Effective Altruism Forum. [Link]Accessed January 2026.
  2. ForecastBench Team (2025). How Well Can Large Language Models Predict the Future?. Forecasting Research Substack. [Link]Accessed January 2026.
  3. Chen et al. (2025). Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents. EMNLP 2025. [Link]Accessed January 2026.
  4. Zhang et al. (2025). KalshiBench: Do LLMs Know What They Don't Know?. arXiv:2512.16030. [Link]Accessed January 2026.
  5. Panshul42 (2025). Forecasting Bot Q2: Metaculus Tournament Winner. GitHub Repository. [Link]Accessed January 2026.
  6. Metaculus (2025). Forecasting Tools Framework. GitHub Repository. [Link]Accessed January 2026.
  7. Xu et al. (2025). LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena. arXiv:2510.17638. [Link]Accessed January 2026.