- Architecture Gap: Metaculus SOTA uses 5 models + 4-5 data sources; PredictionArena appears to use single models with Kalshi data only
- Performance Gap: Professional forecasters beat ALL bots in Metaculus Q2 2025 (head-to-head: -20.03, p = 0.00001)[1]
- Text-Only Handicap: LLMs perform 41.7% worse without visual market data (EMNLP 2025)[3]
In the race to build AI systems that can predict the future, one platform has caught attention: PredictionArena.ai, created by Arcada Labs. The promise is tantalizing—autonomous AI agents competing on real prediction markets, proving whether machines can outforecast humans.
But beneath the Next.js interface and real-time leaderboards lies a troubling reality. Through extensive reverse-engineering of their JavaScript bundles, API endpoints, and database schema, we've uncovered a fundamental architectural flaw: PredictionArena.ai starves its AI agents of the very data they need to make intelligent predictions.
Methodology & Transparency
Verified
- • Supabase PostgreSQL backend
- • Agent-runner framework (GitHub)
- • Kalshi API integration
- • cycles table with reasoning TEXT
Inferred
- • No visible RAG in agent-runner
- • No web research tools in bundles
- • Text-only Kalshi data input
- • No visible base rate analysis
Unknown
- • /api/research implementation
- • Server-side research pipelines
- • Actual prompt engineering
- • Calibration mechanisms
Architecture Analysis
Backend Infrastructure
1// Extracted from layout-*.js bundle2const SUPABASE_URL = "https://artjdzrirbbezfjpfzyd.supabase.co"3const SUPABASE_ANON_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."4 5// Real-time subscription configuration6const realtimeConfig = {7 eventsPerSecond: 10 // Rate limited8}Database Schema (Inferred)
1-- cycles (prediction reasoning)2CREATE TABLE cycles (3 id UUID PRIMARY KEY,4 created_at TIMESTAMP,5 agent_id UUID REFERENCES prediction_arena_agents(id),6 reasoning TEXT, -- FREE TEXT BLOB7 thinking_duration_ms INT -- Just a timer8);The Core Problem: Data Starvation
PredictionArena.ai's scaffolding provides LLMs with minimal context for autonomous research. Unlike sophisticated forecasting systems, PredictionArena appears to feed its agents only Kalshi market data—no web search, no news analysis, no historical base rates, no expert opinion aggregation.
SOTA Comparison: Metaculus Q2 2025 Winners
The Metaculus Q2 2025 AI Forecasting Benchmark saw 96 bots compete on 300+ questions for $30,000 in prizes.[1] The winning architectures demonstrate what “state-of-the-art” actually means.
Winner: Panshul42 (Score: 5,899 | Prize: $7,550)
Multi-Stage Retrieval-Augmented Pipeline
- Question parsing and context extraction
- Search query generation via LLMs
- Asynchronous information retrieval from 4+ sources
- Content synthesis and filtering
- Parallel forecast generation across 5-agent committee
- Ensemble aggregation with weighted averaging
5-Agent Forecasting Committee
- 2× Claude 3.7 Sonnet (inside view reasoning)
- 2× GPT-o4-mini (outside view reasoning)
- 1× GPT-o3 (double-weighted)
Direct Comparison
| Metric | Metaculus SOTA | PredictionArena |
|---|---|---|
| Data Sources | 4-5 (AskNews, Perplexity, Search APIs, Web Scraping) | ~1 (Kalshi market data only) |
| Research Pipeline | 6-7 structured stages | Monolithic reasoning blob |
| Model Ensemble | 5 models with weighted averaging | Single model per agent |
| Inside/Outside View | Explicitly separated | Combined/unclear |
| Aggregation Method | Filtered median across models | None visible |
| Calibration Mechanism | Historical performance integration | None visible |
Research Evidence
The following visualizations present data from peer-reviewed academic research, not subjective assessments.
Q2 2025: -20.03 (95% CI: [-28.63, -11.41])
p = 0.00001 — All 10 Pros ranked above every bot
The Reasoning Paradox: GPT-5.2-XHigh with extended reasoning achieved ECE=0.395 (worst) and BSS=-0.799 (catastrophically negative).
Performance Gap
41.7%
Visual representations outperform
text-only by nearly half
Kodak Case Study: When asked about July 28, 2020, the model recalled news from July 29 about the 318% surge—information recall from training data rather than reasoning.
Architectural Gap Analysis
Based on our reverse-engineering, we identify six critical gaps between PredictionArena's observable architecture and SOTA systems:
| Gap | SOTA Approach | PredictionArena |
|---|---|---|
| Multi-Source Data | 4-5 concurrent APIs (AskNews, Perplexity, Google) | Kalshi market data only |
| Research Pipeline | 6-7 step agentic workflow with structured outputs | Monolithic TEXT blob |
| Base Rate Analysis | BaseRateResearcher with historical probabilities | None visible |
| Model Ensemble | 5 models with weighted averaging | Single model per agent |
| Inside/Outside View | Explicitly separated reasoning modes | Combined/unclear |
| Calibration | Overconfidence prompting, historical adjustment | None visible |
Pragmatic Recommendations
Based on PredictionArena's existing architecture (Supabase + agent-runner + Kalshi API), here are implementation-ready improvements:
1. Structured Research Pipeline
1// Replace monolithic TEXT with structured JSONB2interface StructuredReasoning {3 outside_view: {4 reference_class: string;5 base_rate: number;6 historical_events: string[];7 };8 inside_view: {9 causal_factors: string[];10 mechanistic_analysis: string;11 };12 data_sources: string[];13 confidence_calibration: number;14 thinking_duration_ms: number;15}16 17// ALTER TABLE cycles ADD COLUMN research_artifacts JSONB;2. Multi-Source Data Integration
1# Add to agent-runner/tools/2research_sources = {3 "perplexity": "llama-3.1-sonar-huge-128k-online",4 "asknews": "max_news + historical_articles",5 "web_search": "Google/Bing/DuckDuckGo",6}7 8# EMNLP 2025: This alone could improve performance by 41.7%3. Calibration Prompting
1WARNING: You have historically been overconfident.2Base rate for Kalshi positive resolutions: ~45%3Category-specific adjustments:4 - Elections: Focus on polling aggregates5 - Market prices: Consider momentum + fundamentals6 - Disease outbreaks: Weight epidemiological modelsConclusion
PredictionArena.ai asks a compelling question: “Can models predict the future?” Our analysis suggests the platform may be handicapping its agents through architectural constraints—but we acknowledge significant uncertainty about server-side capabilities we cannot observe.
The broader research picture is sobering: even with sophisticated scaffolding, professional forecasters beat every AI bot in Metaculus Q2 2025 (p=0.00001).[1] Extended reasoning can actually worsencalibration. And 19-37% of apparent LLM “predictive power” may be memorization contamination, not genuine inference.
We offer this analysis not as a takedown, but as a technical contribution. We invite the PredictionArena team to respond, and we commit to updating this analysis based on any corrections or clarifications they provide.
References
- Metaculus Research Team (2025). Q2 AI Benchmark Results: Pros Maintain Clear Lead. Effective Altruism Forum. [Link]Accessed January 2026.
- ForecastBench Team (2025). How Well Can Large Language Models Predict the Future?. Forecasting Research Substack. [Link]Accessed January 2026.
- Chen et al. (2025). Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents. EMNLP 2025. [Link]Accessed January 2026.
- Zhang et al. (2025). KalshiBench: Do LLMs Know What They Don't Know?. arXiv:2512.16030. [Link]Accessed January 2026.
- Panshul42 (2025). Forecasting Bot Q2: Metaculus Tournament Winner. GitHub Repository. [Link]Accessed January 2026.
- Metaculus (2025). Forecasting Tools Framework. GitHub Repository. [Link]Accessed January 2026.
- Xu et al. (2025). LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena. arXiv:2510.17638. [Link]Accessed January 2026.