Exploring Zarr: From Fundamentals to Version 3.0 and Beyond [PyCon DE & PyData Berlin 2024]

Learn about Zarr, a format for large array storage: its fundamentals, key features, and improvements in v3. Explore cloud-native capabilities, compression, and migration paths.

Key takeaways
  • Zarr is a format for storing large arrays divided into compressed chunks, popular in genomics, geospatial, bioimaging and scientific domains

  • Key features:

    • Cloud-native storage support
    • Hierarchical organization of arrays into groups
    • Compressed chunked storage
    • Support for massive datasets (petabyte scale)
    • Language-agnostic specification
  • Zarr v3 improvements over v2:

    • Consolidated metadata into single JSON document
    • More language-agnostic specification
    • Better cloud storage optimization
    • Extension mechanism for adding features
    • Variable chunk sizes support
    • Sharding codec to reduce latency
  • Implementation ecosystem:

    • Python reference implementation
    • Implementations in C++, Rust, Julia, JavaScript, Java
    • Growing community and adoption
    • Regular community meetings and governance process
  • Key concepts:

    • Arrays divided into equal-sized chunks
    • Each chunk is independently compressed
    • Metadata stored in zarr.json files
    • Dictionary-like key-value storage model
    • Support for hierarchical organization
  • Extension mechanism in v3:

    • Allows adding features without changing core spec
    • Example: Sharding codec groups multiple chunks
    • Community-driven proposal process (ZEPs)
    • Maintains backwards compatibility
  • Migration path:

    • v2 datasets can still be used
    • Tools being developed for v2 to v3 conversion
    • No requirement to immediately convert existing data