We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Prawiroatmodjo et al. - How to bootstrap a Data Warehouse with DuckDB | SciPy 2024
Learn how to build a modern data warehouse using DuckDB - from local development to cloud deployment. Discover SQL advantages, optimization features & real-world use cases.
-
DuckDB is a lightweight in-process SQL analytics engine with clients in over 15 languages, reaching 1 million downloads per week
-
Key advantages of DuckDB:
- Handles larger than memory data efficiently
- Fast data processing compared to pandas
- Great interoperability with Python ecosystem (Arrow, pandas, NumPy, Polars)
- Can read multiple data formats including parquet, SQLite files
- No separate server needed - everything runs in same process
-
MotherDuck is a serverless DuckDB platform that:
- Enables hybrid local/cloud execution
- Optimizes queries between local and remote resources
- Minimizes data transfer over slow connections
- Allows easy data sharing and collaboration
-
Data warehousing workflow with DuckDB:
- Build local data pipeline
- Store processed data in DuckDB database
- Share/collaborate via MotherDuck
-
Most real-world datasets are smaller than expected:
- Over 99% of datasets are below 5GB
- “Big data” is often not necessary for typical analytics needs
- DuckDB works well for small-to-medium data workloads
-
SQL advantages for data analysis:
- Declarative syntax
- Automatic query optimization
- Easy data pipeline versioning
- Better than managing multiple CSV files
- Natural integration with modern data tools like DBT
-
Built-in optimization features:
- Smart query planning
- Efficient remote/local hybrid execution
- Automatic caching
- Optimized parquet file reading