Data Engineering·6 min read·Nov 10, 2025

DuckDB vs Pandas for Large-Scale Orderbook Analysis

By Imbalance Labs Research

Why This Comparison Matters

Our full dataset contains ~2.3 million rows across 24 instruments, with 47 columns per row. That's roughly 108 million data points. As you scale from exploratory analysis to production pipelines, the choice between Pandas and DuckDB significantly impacts both performance and developer experience.

Pandas has been the default tool for tabular data in Python for over a decade. DuckDB is a newer analytical database that runs in-process (no server needed) and can query Parquet files directly with SQL. Both have their sweet spots — here's how they compare on our orderbook data.

Setup

Loading our Parquet files with both tools:

Pandas
import pandas as pd

df = pd.read_parquet(
  'btc_l2_depth_5m.parquet'
)
print(df.shape)
# (96432, 47)
DuckDB
import duckdb

con = duckdb.connect()
result = con.sql("""
  SELECT count(*)
  FROM read_parquet(
    'btc_l2_depth_5m.parquet'
  )
""").fetchone()
# (96432,)

Query Performance Benchmarks

We benchmarked three common operations on the full 24-instrument dataset (~2.3M rows, all instruments concatenated). Tests run on an M2 MacBook Pro with 16GB RAM:

OperationPandasDuckDBSpeedup
Simple filter (bid_volume_level_1 > 100K)1.2s0.08s15×
Hourly VWAP aggregation (groupby + weighted mean)3.4s0.15s22×
Rolling 20-bar OBI z-score (window function)2.1s0.22s9.5×

Memory Usage

The critical difference: Pandas loads the entire dataset into memory. DuckDB uses streaming execution — it can process files larger than your available RAM.

MetricPandasDuckDB
Peak memory (single instrument)~180 MB~12 MB
Peak memory (all 24 instruments)~4.2 GB~45 MB
Can query files > RAM?

When to Use What

Use CaseRecommendation
Ad-hoc exploration & filteringDuckDB
Complex SQL aggregationsDuckDB
Custom feature engineering for MLPandas
Feeding into scikit-learn / PyTorchPandas
Large-scale data pipelineDuckDB
Interactive Jupyter notebooksBoth (DuckDB → Pandas)

Verdict

For orderbook data analysis at scale, the optimal workflow is a hybrid approach: use DuckDB for initial data loading, filtering, and aggregation (especially when working with all 24 instruments simultaneously), then convert to Pandas DataFrames for custom feature engineering and ML model training.

import duckdb
import pandas as pd

# Best of both worlds: DuckDB for heavy lifting, Pandas for ML
con = duckdb.connect()

# Filter and aggregate with DuckDB (fast, low memory)
df = con.sql("""
  SELECT
    timestamp_utc,
    close_price,
    bid_volume_level_1,
    ask_volume_level_1,
    bid_volume_level_1 / (bid_volume_level_1 + ask_volume_level_1) AS obi
  FROM read_parquet('*.parquet')
  WHERE close_price > 0
  ORDER BY timestamp_utc
""").df()  # .df() converts to Pandas DataFrame

# Now use Pandas for custom feature engineering
df['obi_zscore'] = (df['obi'] - df['obi'].rolling(20).mean()) / df['obi'].rolling(20).std()
df['target'] = df['close_price'].pct_change().shift(-1) > 0

Try It With Real Data

Download our free 7-day sample to benchmark DuckDB vs Pandas on real Hyperliquid orderbook depth data.