What is orderbook depth data and how is it different from OHLCV?

OHLCV (Open-High-Low-Close-Volume) candles only show price and volume at a surface level. Orderbook depth data reveals the actual bid and ask liquidity sitting at multiple price levels — showing you how much buying and selling pressure exists and how far from the market price. Our dataset provides 10 levels of depth on both sides, measured in cumulative volume and basis-point distance from mid-price. This is the raw material that institutional traders and quantitative researchers use to detect order flow, predict short-term price movements, and model realistic execution costs.

Which instruments and timeframe does the dataset cover?

The current release covers 24 major crypto perpetual futures sourced from the leading L1 Perpetual Decentralized Exchange: BTC, ETH, SOL, BNB, XRP, DOGE, ADA, AVAX, LINK, DOT, NEAR, SUI, OP, ARB, SEI, TIA, INJ, APT, FIL, LTC, ETC, WIF, XLM, and ATOM. All data is aggregated into 5-minute bars spanning from March 2025 to February 2026 — approximately 12 months of continuous coverage. Each instrument has ~96,000 bars with 47 derived columns per row, including OHLCV, bid/ask volumes, and bid/ask distances at 10 depth levels.

Can I use this data for machine learning and deep learning?

Absolutely. The dataset is specifically designed for ML workflows. The 47-column schema feeds directly into LSTM networks, Transformer architectures, reinforcement learning environments (like Gymnasium/Stable-Baselines3), and gradient-boosted models (XGBoost, LightGBM). Common derived features include bid-ask imbalance ratios, depth-weighted pressure scores, liquidity concentration metrics, and spread dynamics — all computable in a few lines of Python from our raw columns.

Can you create custom datasets with different instruments, timeframes, or depth levels?

Yes. We can generate orderbook depth data for any cryptocurrency available on major derivatives DEXs and CEXs, at any candle interval (1-minute, 5-minute, 15-minute, 1-hour, etc.), covering any historical time period. We also support extended depth up to 30 levels. Contact us at imbalancelabs@gmail.com with your requirements for a custom quote.

Is this data legally safe to use?

Yes. Our datasets are classified as Derived Data — an Aggregated Liquidity and Orderbook Depth Index. All raw order book snapshots have been aggregated across time intervals, normalized, and transformed through statistical computations. The original tick-level data is not recoverable from our product. This classification places it outside the scope of most exchange data redistribution restrictions.

What format is the data delivered in?

All datasets are delivered as compressed CSV files (.csv.gz), which can be loaded directly by Python pandas, R, DuckDB, Apache Spark, and most data analysis tools. Each file is named by instrument (e.g., BTC_5m_depth10_derived.csv.gz). The full dataset ZIP containing all 24 instruments is approximately 300 MB.

Data Engineering·6 min read·Nov 10, 2025

DuckDB vs Pandas for Large-Scale Orderbook Analysis

By Imbalance Labs Research

Why This Comparison Matters

Our full dataset contains ~2.3 million rows across 24 instruments, with 47 columns per row. That's roughly 108 million data points. As you scale from exploratory analysis to production pipelines, the choice between Pandas and DuckDB significantly impacts both performance and developer experience.

Pandas has been the default tool for tabular data in Python for over a decade. DuckDB is a newer analytical database that runs in-process (no server needed) and can query Parquet files directly with SQL. Both have their sweet spots — here's how they compare on our orderbook data.

Setup

Loading our Parquet files with both tools:

Pandas

import pandas as pd

df = pd.read_parquet(
  'btc_l2_depth_5m.parquet'
)
print(df.shape)
# (96432, 47)

DuckDB

import duckdb

con = duckdb.connect()
result = con.sql("""
  SELECT count(*)
  FROM read_parquet(
    'btc_l2_depth_5m.parquet'
  )
""").fetchone()
# (96432,)

Query Performance Benchmarks

We benchmarked three common operations on the full 24-instrument dataset (~2.3M rows, all instruments concatenated). Tests run on an M2 MacBook Pro with 16GB RAM:

Operation	Pandas	DuckDB	Speedup
Simple filter (bid_volume_level_1 > 100K)	1.2s	0.08s	15×
Hourly VWAP aggregation (groupby + weighted mean)	3.4s	0.15s	22×
Rolling 20-bar OBI z-score (window function)	2.1s	0.22s	9.5×

Memory Usage

The critical difference: Pandas loads the entire dataset into memory. DuckDB uses streaming execution — it can process files larger than your available RAM.

Metric	Pandas	DuckDB
Peak memory (single instrument)	~180 MB	~12 MB
Peak memory (all 24 instruments)	~4.2 GB	~45 MB
Can query files > RAM?	❌	✅

When to Use What

Use Case	Recommendation
Ad-hoc exploration & filtering	DuckDB
Complex SQL aggregations	DuckDB
Custom feature engineering for ML	Pandas
Feeding into scikit-learn / PyTorch	Pandas
Large-scale data pipeline	DuckDB
Interactive Jupyter notebooks	Both (DuckDB → Pandas)

Verdict

For orderbook data analysis at scale, the optimal workflow is a hybrid approach: use DuckDB for initial data loading, filtering, and aggregation (especially when working with all 24 instruments simultaneously), then convert to Pandas DataFrames for custom feature engineering and ML model training.

import duckdb
import pandas as pd

# Best of both worlds: DuckDB for heavy lifting, Pandas for ML
con = duckdb.connect()

# Filter and aggregate with DuckDB (fast, low memory)
df = con.sql("""
  SELECT
    timestamp_utc,
    close_price,
    bid_volume_level_1,
    ask_volume_level_1,
    bid_volume_level_1 / (bid_volume_level_1 + ask_volume_level_1) AS obi
  FROM read_parquet('*.parquet')
  WHERE close_price > 0
  ORDER BY timestamp_utc
""").df()  # .df() converts to Pandas DataFrame

# Now use Pandas for custom feature engineering
df['obi_zscore'] = (df['obi'] - df['obi'].rolling(20).mean()) / df['obi'].rolling(20).std()
df['target'] = df['close_price'].pct_change().shift(-1) > 0

Try It With Real Data

Download our free 7-day sample to benchmark DuckDB vs Pandas on real Hyperliquid orderbook depth data.

Browse All Datasets →View 47-Column Schema