DuckDB vs Pandas for Large-Scale Orderbook Analysis
By Imbalance Labs Research
Why This Comparison Matters
Our full dataset contains ~2.3 million rows across 24 instruments, with 47 columns per row. That's roughly 108 million data points. As you scale from exploratory analysis to production pipelines, the choice between Pandas and DuckDB significantly impacts both performance and developer experience.
Pandas has been the default tool for tabular data in Python for over a decade. DuckDB is a newer analytical database that runs in-process (no server needed) and can query Parquet files directly with SQL. Both have their sweet spots — here's how they compare on our orderbook data.
Setup
Loading our Parquet files with both tools:
import pandas as pd df = pd.read_parquet( 'btc_l2_depth_5m.parquet' ) print(df.shape) # (96432, 47)
import duckdb
con = duckdb.connect()
result = con.sql("""
SELECT count(*)
FROM read_parquet(
'btc_l2_depth_5m.parquet'
)
""").fetchone()
# (96432,)Query Performance Benchmarks
We benchmarked three common operations on the full 24-instrument dataset (~2.3M rows, all instruments concatenated). Tests run on an M2 MacBook Pro with 16GB RAM:
| Operation | Pandas | DuckDB | Speedup |
|---|---|---|---|
| Simple filter (bid_volume_level_1 > 100K) | 1.2s | 0.08s | 15× |
| Hourly VWAP aggregation (groupby + weighted mean) | 3.4s | 0.15s | 22× |
| Rolling 20-bar OBI z-score (window function) | 2.1s | 0.22s | 9.5× |
Memory Usage
The critical difference: Pandas loads the entire dataset into memory. DuckDB uses streaming execution — it can process files larger than your available RAM.
| Metric | Pandas | DuckDB |
|---|---|---|
| Peak memory (single instrument) | ~180 MB | ~12 MB |
| Peak memory (all 24 instruments) | ~4.2 GB | ~45 MB |
| Can query files > RAM? | ❌ | ✅ |
When to Use What
| Use Case | Recommendation |
|---|---|
| Ad-hoc exploration & filtering | DuckDB |
| Complex SQL aggregations | DuckDB |
| Custom feature engineering for ML | Pandas |
| Feeding into scikit-learn / PyTorch | Pandas |
| Large-scale data pipeline | DuckDB |
| Interactive Jupyter notebooks | Both (DuckDB → Pandas) |
Verdict
For orderbook data analysis at scale, the optimal workflow is a hybrid approach: use DuckDB for initial data loading, filtering, and aggregation (especially when working with all 24 instruments simultaneously), then convert to Pandas DataFrames for custom feature engineering and ML model training.
import duckdb
import pandas as pd
# Best of both worlds: DuckDB for heavy lifting, Pandas for ML
con = duckdb.connect()
# Filter and aggregate with DuckDB (fast, low memory)
df = con.sql("""
SELECT
timestamp_utc,
close_price,
bid_volume_level_1,
ask_volume_level_1,
bid_volume_level_1 / (bid_volume_level_1 + ask_volume_level_1) AS obi
FROM read_parquet('*.parquet')
WHERE close_price > 0
ORDER BY timestamp_utc
""").df() # .df() converts to Pandas DataFrame
# Now use Pandas for custom feature engineering
df['obi_zscore'] = (df['obi'] - df['obi'].rolling(20).mean()) / df['obi'].rolling(20).std()
df['target'] = df['close_price'].pct_change().shift(-1) > 0Try It With Real Data
Download our free 7-day sample to benchmark DuckDB vs Pandas on real Hyperliquid orderbook depth data.