We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Josh Borrow - Making Research Data Flow with Python | SciPy 2024
Learn how Josh Borrow designed a Python-based data management system to handle petabyte-scale data transfer and storage for a telescope in Chile's Atacama Desert.
-
The Librarian is a Python-based web service built to manage large-scale data transfer and storage for the Simons Observatory telescope in Chile’s Atacama Desert
-
Key technical components include:
- FastAPI and Pydantic for the web framework
- SQLAlchemy for database ORM
- Schedule for background task management
- Dependency injection for service management
- pytest-xprocess for testing
-
Data challenges addressed:
- Moving ~1 petabyte/year from remote telescope
- Limited network bandwidth (50Mbps radio link)
- Need for data redundancy and integrity verification
- Managing manual “sneakernet” hard drive transfers
-
Design principles:
- Keep components simple and modular
- Delegate security/permissions to Unix filesystem
- Focus on API-first design without UI requirements
- Use environment variables for configuration
- Enable distributed operation across multiple sites
-
Testing recommendations:
- Focus on integration and end-to-end tests over unit tests
- Use pytest-xprocess to spin up test instances
- Prepare early for external service testing (e.g. Globus)
- Use SQLite for test databases when possible
-
The system manages data flow through:
- File manifests for tracking content
- Background tasks for integrity checks
- Multiple redundant copies
- Site-specific librarian instances
- Both network and physical transfer methods
-
Web services for scientific applications don’t need to be complex - simple tools solving specific problems can be very effective