DuckDB

DuckDB

Overview

Learn how to use DuckDB, an in-process analytical database, for fast SQL analytics on financial data. DuckDB is perfect for OLAP queries on Parquet files, CSV data, and in-memory datasets without needing a separate database server.

What You’ll Learn

  • DuckDB Fundamentals: SQL analytics in-process
  • Parquet Integration: Direct querying of Parquet files
  • Data Analysis: Fast aggregations and window functions
  • C++ Integration: Embed DuckDB in C++ applications
  • Performance Optimization: Columnar storage benefits
  • File Formats: Work with CSV, JSON, Parquet seamlessly

Key Technologies

  • DuckDB: In-process OLAP database
  • Parquet: Columnar storage format
  • SQL: ANSI SQL for analytics
  • C++ API: Native DuckDB integration
  • Arrow: Zero-copy data exchange
  • Pandas/Polars: DataFrame interoperability

Documentation Index

  1. DuckDB Data Analysis - Analytical queries and data processing with DuckDB

Use Cases

Financial Analytics

  • Stock Quotes Analysis: Aggregate OHLCV data
  • Time-series Queries: Date-based filtering and grouping
  • Technical Indicators: Calculate moving averages, RSI, MACD
  • Portfolio Analysis: Join and analyze holdings
  • Backtesting: Historical strategy simulation

Data Processing

  • ETL Pipelines: Transform raw data to analytics format
  • Data Validation: Check data quality and completeness
  • Reporting: Generate summary statistics
  • Data Export: Convert between formats (CSV ↔ Parquet)

S3 Integration

  • Direct Queries: Query Parquet files on S3
  • Data Lake: Analyze large datasets without loading
  • Cost Savings: Reduce data transfer and storage
  • Incremental Processing: Process only new partitions

Architecture

In-Process Analytics

Your Application (C++/Python) → DuckDB (embedded) → Parquet/CSV Files
                                       ↓
                                  SQL Query Engine
                                       ↓
                                Fast Results (microseconds)

Data Pipeline

Raw Data (S3/Local) → DuckDB → Transformations → Output (Parquet/DB)

Why DuckDB?

Performance

  • Columnar Storage: Optimized for analytics
  • Vectorized Execution: Process data in batches
  • Parallel Query Execution: Multi-threaded queries
  • Zero-copy Operations: Minimal data movement

Simplicity

  • No Server: Embedded in your application
  • Single File: Database in one file or in-memory
  • No Configuration: Works out of the box
  • SQL Standard: Familiar SQL syntax

Integration

  • Multiple Languages: C++, Python, R, Java, Node.js
  • File Formats: Parquet, CSV, JSON native support
  • Cloud Storage: S3, GCS, Azure Blob
  • DataFrame Libraries: Pandas, Polars, Arrow

Key Features

SQL Analytics

SELECT ticker, 
       AVG(close) as avg_price,
       MAX(volume) as max_volume
FROM read_parquet('s3://bucket/quotes/*.parquet')
WHERE date >= '2024-01-01'
GROUP BY ticker
ORDER BY avg_price DESC;

Window Functions

SELECT ticker, date, close,
       AVG(close) OVER (PARTITION BY ticker 
                        ORDER BY date 
                        ROWS BETWEEN 19 PRECEDING AND CURRENT ROW) as sma_20
FROM quotes;

Joins and Aggregations

SELECT q.ticker, f.company_name, AVG(q.close) as avg_price
FROM quotes q
JOIN fundamentals f ON q.ticker = f.ticker
WHERE q.date >= '2024-01-01'
GROUP BY q.ticker, f.company_name;

Performance Comparison

OperationDuckDBSQLitePandas
10M row scan0.1s5s2s
Aggregation0.05s3s1s
Join0.2s10s5s
Memory usageLowMediumHigh

Use with Sravz

C++ Backend

  • Embed DuckDB for fast analytics endpoints
  • Query Parquet files from S3
  • Cache results in Redis
  • Serve to Angular frontend

Python Backend

  • Data analysis and validation
  • Generate statistics and reports
  • ETL processing
  • Integration with Pandas/Polars

Data Flow

S3 Parquet → DuckDB → Analytics → REST API → Frontend
                ↓
             Cache (Redis)

Getting Started

  1. Read DuckDB Data Analysis
  2. Install DuckDB library (C++ or Python)
  3. Load sample Parquet file
  4. Run analytical queries
  5. Integrate into your backend

Best Practices

Performance

  • Use Parquet: Columnar format for fast scans
  • Partition Data: Partition by date for pruning
  • Indexes: Create indexes for frequent lookups
  • Limit Results: Use LIMIT for large datasets
  • Parallel Queries: DuckDB auto-parallelizes

Data Management

  • Schema Evolution: Handle schema changes gracefully
  • Data Types: Use appropriate types (DATE, DECIMAL)
  • Compression: Parquet handles compression
  • Incremental Updates: Append-only for time-series

Integration

  • Connection Pooling: Reuse connections in C++
  • Error Handling: Check query results
  • Memory Limits: Set max_memory for large queries
  • Read-only Mode: Use for production safety

Advanced Features

  • Persistent Database: Store results in .duckdb file
  • Views: Create virtual tables
  • CTEs: Common Table Expressions for complex queries
  • User-Defined Functions: Extend with custom logic
  • Extensions: httpfs for S3, spatial for GIS

Related: C++ | Python | Analytics