Going beyond Parquet's default settings – be surprised what you can get

Learn how to optimize Parquet files beyond defaults - from compression algorithms and encoding schemes to file sizes and sorting strategies for better performance.

Key takeaways

Parquet’s column-based storage format enables efficient compression and analytics workloads
Dictionary encoding is the default in Parquet and works well for repeated values, providing significant space savings
Row groups affect performance - more groups are less efficient but enable better filtering
Compression algorithms tradeoffs:
- Snappy and zstd are fast with good compression
- Gzip is slower and rarely the optimal choice
- Brotli offers good compression but takes longer
Byte stream split encoding is beneficial for floating-point data and machine learning predictions
Sort data before writing to Parquet for better compression ratios
Consider file size and row group size based on use case:
- Single large files can be problematic
- Balance between file count and size needed
Predicate pushdown allows efficient filtering without reading entire files
Data type optimization happens automatically - Parquet uses minimum required bits
The format is language-agnostic and widely supported across data tools/ecosystems
Default settings in different tools (Pandas, Polars) may vary based on their typical use cases
Metadata and schema information stored in file footer
Compression levels can be tuned for optimal speed vs size tradeoff
Test different configurations on your specific dataset as results vary by data characteristics

Going beyond Parquet's default settings – be surprised what you can get

More talks