Santiago Soler - Pooch: a friend to fetch your data files | SciPy 2024

Learn how Pooch, a Python library, helps download & cache data files from the web. Features checksums, multi-protocol support & integration with data analysis tools.

Key takeaways
  • Pooch is a Python library for downloading and caching data files from the web while verifying file integrity through checksums

  • Key features:

    • Downloads from multiple protocols and services (GitHub, Zenodo, Dataverse)
    • File integrity verification
    • Caching system to avoid redundant downloads
    • Support for custom downloaders and processors
    • Chunked downloads for large files
  • Common use cases:

    • Package maintainers managing sample datasets
    • Researchers downloading scientific data
    • Teachers/tutorial creators sharing example files
    • Integration into reproducible workflows
  • Implementation approaches:

    • Basic usage with pooch.retrieve() for simple downloads
    • Pooch class for managing multiple files via registry
    • Version-specific downloads (development vs release versions)
    • Custom processors for handling archives/zip files
  • Advanced capabilities:

    • Shared caches across user groups
    • Custom download processors
    • Plugin system for community-developed downloaders
    • Registry management for file names and hashes
    • Integration with data analysis tools (pandas, xarray)
  • Future roadmap:

    • Improved logging configuration
    • Better handling of custom URLs
    • Single registry for URLs and hashes
    • Enhanced plugin system
    • JSON file format support