Using Polymarket Data for Machine Learning Models
Prediction markets are a rare thing in machine learning: a continuous, real-money probability estimate with a hard ground-truth label at the end. Every market eventually resolves to 0 or 1, which means every historical price has a known correct answer. That's an unusually clean supervised-learning setup — if you have the data.
Why prediction-market data is good for ML
- Built-in labels. The resolution *is* the label. No hand-annotation, no ambiguity.
- Calibrated targets. Prices are real-money probabilities, so you can train and evaluate against actual calibration, not just classification accuracy.
- Rich microstructure. Order-book depth and spread carry information about conviction and liquidity that a single price hides.
- Many independent markets. Thousands of resolved markets give you a large, varied training set across very different event types.
What features you can build
From a 15-minute snapshot series you can engineer:
- price momentum and volatility over rolling windows
- spread and book imbalance (bid depth vs ask depth)
- time-to-resolution decay features
- cross-market signals (correlated events moving together)
- the realized outcome as your label
The data you need
Training needs density and history: enough snapshots, across enough markets, with order-book detail — not just a sparse price line. The Polymarket Historical Dataset is purpose-built for this: 18.5M+ rows of price + depth at a 15-minute cadence across 18,400+ markets, each row carrying mid, best bid/ask, depth, spread, and a UTC timestamp, with the market's final outcome available as your target.
It's CSV, so the path into a model is short:
import pandas as pd
df = pd.read_csv("polymarket_history.csv", parse_dates=["timestamp"])
df = df.sort_values(["market", "timestamp"])
df["book_imbalance"] = df["bid_depth"] / (df["bid_depth"] + df["ask_depth"])
# join each row to its market's resolved outcome as the label, then train
Avoid the classic leakage trap
Because every market resolves, it's tempting to use information from late in a market's life to predict earlier prices. Don't. Split by market and by time, and make sure features at time *t* only use data available at *t*. The same discipline applies whether you're training a model or backtesting a trading strategy.
Start with clean data
The modeling is the fun part; sourcing months of aligned, gap-checked history is the grind. Skip it — grab the Polymarket Historical Dataset ($15 one-time, or a $29/mo refreshed feed for continuous training) and spend your time on features, not collectors.