Using Polymarket Data for Machine Learning Models

polymarket, machine learning, datasets, prediction markets

Prediction markets are a rare thing in machine learning: a continuous, real-money probability estimate with a hard ground-truth label at the end. Every market eventually resolves to 0 or 1, which means every historical price has a known correct answer. That's an unusually clean supervised-learning setup — if you have the data.

Why prediction-market data is good for ML

Built-in labels. The resolution *is* the label. No hand-annotation, no ambiguity.
Calibrated targets. Prices are real-money probabilities, so you can train and evaluate against actual calibration, not just classification accuracy.
Rich microstructure. Order-book depth and spread carry information about conviction and liquidity that a single price hides.
Many independent markets. Thousands of resolved markets give you a large, varied training set across very different event types.

What features you can build

From a 15-minute snapshot series you can engineer:

price momentum and volatility over rolling windows
spread and book imbalance (bid depth vs ask depth)
time-to-resolution decay features
cross-market signals (correlated events moving together)
the realized outcome as your label

The data you need

Training needs density and history: enough snapshots, across enough markets, with order-book detail — not just a sparse price line. The Polymarket Historical Dataset is purpose-built for this: 18.5M+ rows of price + depth at a 15-minute cadence across 18,400+ markets, each row carrying mid, best bid/ask, depth, spread, and a UTC timestamp, with the market's final outcome available as your target.

It's CSV, so the path into a model is short:


import pandas as pd
df = pd.read_csv("polymarket_history.csv", parse_dates=["timestamp"])
df = df.sort_values(["market", "timestamp"])
df["book_imbalance"] = df["bid_depth"] / (df["bid_depth"] + df["ask_depth"])
# join each row to its market's resolved outcome as the label, then train

Avoid the classic leakage trap

Because every market resolves, it's tempting to use information from late in a market's life to predict earlier prices. Don't. Split by market and by time, and make sure features at time *t* only use data available at *t*. The same discipline applies whether you're training a model or backtesting a trading strategy.

Start with clean data

The modeling is the fun part; sourcing months of aligned, gap-checked history is the grind. Skip it — grab the Polymarket Historical Dataset ($15 one-time, or a $29/mo refreshed feed for continuous training) and spend your time on features, not collectors.