Awowale et al. - Simplifying analysis of hierarchical HDF5 and NetCDF4 files with xarray-datatree

Learn how DataTree simplifies working with hierarchical HDF5/NetCDF4 files, making NASA's petabyte-scale Earth observation data more accessible through efficient tree-based structures.

Key takeaways
  • DataTree provides a simplified way to work with hierarchical HDF5 and NetCDF4 files by representing groups as a tree structure, avoiding the need to open multiple datasets separately

  • NASA has over 100 petabytes of Earth observation data stored in HDF format, with expected growth to 600 petabytes from new missions. Managing and accessing this data efficiently is a key challenge

  • Current tools like X-Ray and NetCDF4 require specifying groups individually to open data, which is inefficient and leads to complex code. DataTree allows viewing the entire dataset hierarchy at once

  • DataTree integrates with X-Ray to provide lazy loading, efficient computations, and familiar X-Ray operations while maintaining the hierarchical structure of the data

  • The tool enables cloud-optimized access to HDF data by reducing unnecessary file operations and memory usage compared to traditional methods

  • DataTree simplifies subsetting operations by eliminating the need to flatten/unflatten data structures and copy datasets multiple times

  • The project represents a collaboration between NASA and the open source community, with NASA engineers contributing to improve accessibility of Earth science data

  • DataTree will be integrated into the main X-Ray package in an upcoming release, providing long-term support and standardization

  • The tool supports various data formats beyond just HDF5/NetCDF4 and can be extended to work with other hierarchical data formats

  • Current implementations show significant performance improvements, with operations being up to 1000x faster compared to naive implementations when working with nested groups