Francesc Alted - Blosc2: Fast And Flexible Handling Of N-Dimensional and Sparse Datasets

Learn about Blosc2, a fast data compression library with advanced features like double partitioning, AI-powered parameter tuning, and support for massive N-dimensional datasets.

Key takeaways
  • Blosc2 is both a C and Python library for fast data compression, with a simple format specification under 300 lines
  • Features double partitioning (chunks and blocks), enabling more selective and faster queries compared to single-partition formats
  • Supports multi-dimensional arrays up to 63-bit containers and can handle datasets up to 8 trillion cells (~8TB)
  • Includes BTune for automatic compression parameter selection using:
    • Genetic algorithms to test parameter combinations
    • Deep learning models for real-time codec/filter selection
    • Local training capabilities for custom datasets
  • Offers dynamic plugin support for extending functionality with custom codecs and filters
  • Provides integration with HDF5 through PyTables and H5Py wrappers
  • Achieves 5-8x better speed when using second partition optimization
  • Implements JPEG 2000 support through the Grok plugin for lossy compression
  • Mimics NumPy API for ease of use and familiar syntax
  • Supports multiple languages beyond Python, including C++, Rust, Julia, and R through the CBlosc underlying library