Josh Borrow - Making Research Data Flow with Python | SciPy 2024

Learn how Josh Borrow designed a Python-based data management system to handle petabyte-scale data transfer and storage for a telescope in Chile's Atacama Desert.

Key takeaways
  • The Librarian is a Python-based web service built to manage large-scale data transfer and storage for the Simons Observatory telescope in Chile’s Atacama Desert

  • Key technical components include:

    • FastAPI and Pydantic for the web framework
    • SQLAlchemy for database ORM
    • Schedule for background task management
    • Dependency injection for service management
    • pytest-xprocess for testing
  • Data challenges addressed:

    • Moving ~1 petabyte/year from remote telescope
    • Limited network bandwidth (50Mbps radio link)
    • Need for data redundancy and integrity verification
    • Managing manual “sneakernet” hard drive transfers
  • Design principles:

    • Keep components simple and modular
    • Delegate security/permissions to Unix filesystem
    • Focus on API-first design without UI requirements
    • Use environment variables for configuration
    • Enable distributed operation across multiple sites
  • Testing recommendations:

    • Focus on integration and end-to-end tests over unit tests
    • Use pytest-xprocess to spin up test instances
    • Prepare early for external service testing (e.g. Globus)
    • Use SQLite for test databases when possible
  • The system manages data flow through:

    • File manifests for tracking content
    • Background tasks for integrity checks
    • Multiple redundant copies
    • Site-specific librarian instances
    • Both network and physical transfer methods
  • Web services for scientific applications don’t need to be complex - simple tools solving specific problems can be very effective