Pavithra Eswaramoorthy & Jaime Rodríguez-Guerra - Ensuring Runtime Reproducibility in Python

Learn key strategies for Python runtime reproducibility, from environment management to risk mitigation. Explore tools like Conda, Docker & best practices for reliable code execution.

Key takeaways
  • Reproducibility requires proactive planning and cannot be an afterthought - it needs to be modeled early in the development process

  • Runtime reproducibility framework consists of 4 key steps:

    • Define objectives and scope
    • Enumerate components
    • Evaluate threats/risks
    • Apply mitigation measures
  • Key considerations for reproducibility:

    • Explicit source of packages and Python interpreters
    • Platform OS specifications
    • Hardware requirements
    • Dependency versions
    • Data storage locations
    • Infrastructure components
  • Best practices include:

    • Using version control
    • Creating environment files (environment.yml)
    • Generating dependency logs
    • Building in redundancy for critical components
    • Using internal mirrors when needed
    • Restricting channels/versions as appropriate
  • Tools and approaches:

    • Conda/Mamba for environment management
    • Docker containers for isolation
    • Virtual machines for full system reproducibility
    • CondaStore for simplified environment management
    • Watermark for tracking runtime details
  • Workflows must enable and encourage reproducibility - if the process is too complex, users will default to less reproducible patterns

  • Different levels of reproducibility exist - teams need to consciously decide what level is appropriate for their needs and accept associated risks

  • Reproducibility challenges in data science are complicated by:

    • Fast-moving ecosystem
    • Multiple packaging systems
    • Non-pure Python dependencies
    • Hardware/OS variations
  • Being explicit about limitations and requirements helps manage expectations:

    • Supported operating systems
    • Tested configurations
    • Required dependencies
    • License considerations
  • Documentation should include complete installation steps, execution procedures, and all runtime details needed for reproduction