DuckDB
Overview
Learn how to use DuckDB, an in-process analytical database, for fast SQL analytics on financial data. DuckDB is perfect for OLAP queries on Parquet files, CSV data, and in-memory datasets without needing a separate database server.
What You’ll Learn
- DuckDB Fundamentals: SQL analytics in-process
- Parquet Integration: Direct querying of Parquet files
- Data Analysis: Fast aggregations and window functions
- C++ Integration: Embed DuckDB in C++ applications
- Performance Optimization: Columnar storage benefits
- File Formats: Work with CSV, JSON, Parquet seamlessly
Key Technologies
- DuckDB: In-process OLAP database
- Parquet: Columnar storage format
- SQL: ANSI SQL for analytics
- C++ API: Native DuckDB integration
- Arrow: Zero-copy data exchange
- Pandas/Polars: DataFrame interoperability
Documentation Index
- DuckDB Data Analysis - Analytical queries and data processing with DuckDB
Use Cases
Financial Analytics
- Stock Quotes Analysis: Aggregate OHLCV data
- Time-series Queries: Date-based filtering and grouping
- Technical Indicators: Calculate moving averages, RSI, MACD
- Portfolio Analysis: Join and analyze holdings
- Backtesting: Historical strategy simulation
Data Processing
- ETL Pipelines: Transform raw data to analytics format
- Data Validation: Check data quality and completeness
- Reporting: Generate summary statistics
- Data Export: Convert between formats (CSV ↔ Parquet)
S3 Integration
- Direct Queries: Query Parquet files on S3
- Data Lake: Analyze large datasets without loading
- Cost Savings: Reduce data transfer and storage
- Incremental Processing: Process only new partitions
Architecture
In-Process Analytics
Your Application (C++/Python) → DuckDB (embedded) → Parquet/CSV Files
↓
SQL Query Engine
↓
Fast Results (microseconds)
Data Pipeline
Raw Data (S3/Local) → DuckDB → Transformations → Output (Parquet/DB)
Why DuckDB?
Performance
- Columnar Storage: Optimized for analytics
- Vectorized Execution: Process data in batches
- Parallel Query Execution: Multi-threaded queries
- Zero-copy Operations: Minimal data movement
Simplicity
- No Server: Embedded in your application
- Single File: Database in one file or in-memory
- No Configuration: Works out of the box
- SQL Standard: Familiar SQL syntax
Integration
- Multiple Languages: C++, Python, R, Java, Node.js
- File Formats: Parquet, CSV, JSON native support
- Cloud Storage: S3, GCS, Azure Blob
- DataFrame Libraries: Pandas, Polars, Arrow
Key Features
SQL Analytics
SELECT ticker,
AVG(close) as avg_price,
MAX(volume) as max_volume
FROM read_parquet('s3://bucket/quotes/*.parquet')
WHERE date >= '2024-01-01'
GROUP BY ticker
ORDER BY avg_price DESC;
Window Functions
SELECT ticker, date, close,
AVG(close) OVER (PARTITION BY ticker
ORDER BY date
ROWS BETWEEN 19 PRECEDING AND CURRENT ROW) as sma_20
FROM quotes;
Joins and Aggregations
SELECT q.ticker, f.company_name, AVG(q.close) as avg_price
FROM quotes q
JOIN fundamentals f ON q.ticker = f.ticker
WHERE q.date >= '2024-01-01'
GROUP BY q.ticker, f.company_name;
Performance Comparison
| Operation | DuckDB | SQLite | Pandas |
|---|---|---|---|
| 10M row scan | 0.1s | 5s | 2s |
| Aggregation | 0.05s | 3s | 1s |
| Join | 0.2s | 10s | 5s |
| Memory usage | Low | Medium | High |
Use with Sravz
C++ Backend
- Embed DuckDB for fast analytics endpoints
- Query Parquet files from S3
- Cache results in Redis
- Serve to Angular frontend
Python Backend
- Data analysis and validation
- Generate statistics and reports
- ETL processing
- Integration with Pandas/Polars
Data Flow
S3 Parquet → DuckDB → Analytics → REST API → Frontend
↓
Cache (Redis)
Getting Started
- Read DuckDB Data Analysis
- Install DuckDB library (C++ or Python)
- Load sample Parquet file
- Run analytical queries
- Integrate into your backend
Best Practices
Performance
- Use Parquet: Columnar format for fast scans
- Partition Data: Partition by date for pruning
- Indexes: Create indexes for frequent lookups
- Limit Results: Use LIMIT for large datasets
- Parallel Queries: DuckDB auto-parallelizes
Data Management
- Schema Evolution: Handle schema changes gracefully
- Data Types: Use appropriate types (DATE, DECIMAL)
- Compression: Parquet handles compression
- Incremental Updates: Append-only for time-series
Integration
- Connection Pooling: Reuse connections in C++
- Error Handling: Check query results
- Memory Limits: Set max_memory for large queries
- Read-only Mode: Use for production safety
Advanced Features
- Persistent Database: Store results in .duckdb file
- Views: Create virtual tables
- CTEs: Common Table Expressions for complex queries
- User-Defined Functions: Extend with custom logic
- Extensions: httpfs for S3, spatial for GIS
