Prawiroatmodjo et al. - How to bootstrap a Data Warehouse with DuckDB | SciPy 2024

Python

Learn how to build a modern data warehouse using DuckDB - from local development to cloud deployment. Discover SQL advantages, optimization features & real-world use cases.

Key takeaways

DuckDB is a lightweight in-process SQL analytics engine with clients in over 15 languages, reaching 1 million downloads per week
Key advantages of DuckDB:
- Handles larger than memory data efficiently
- Fast data processing compared to pandas
- Great interoperability with Python ecosystem (Arrow, pandas, NumPy, Polars)
- Can read multiple data formats including parquet, SQLite files
- No separate server needed - everything runs in same process
MotherDuck is a serverless DuckDB platform that:
- Enables hybrid local/cloud execution
- Optimizes queries between local and remote resources
- Minimizes data transfer over slow connections
- Allows easy data sharing and collaboration
Data warehousing workflow with DuckDB:
1. Build local data pipeline
2. Store processed data in DuckDB database
3. Share/collaborate via MotherDuck
Most real-world datasets are smaller than expected:
- Over 99% of datasets are below 5GB
- “Big data” is often not necessary for typical analytics needs
- DuckDB works well for small-to-medium data workloads
SQL advantages for data analysis:
- Declarative syntax
- Automatic query optimization
- Easy data pipeline versioning
- Better than managing multiple CSV files
- Natural integration with modern data tools like DBT
Built-in optimization features:
- Smart query planning
- Efficient remote/local hybrid execution
- Automatic caching
- Optimized parquet file reading

Prawiroatmodjo et al. - How to bootstrap a Data Warehouse with DuckDB | SciPy 2024

More talks