Fabian Höring - Building reproducible distributed applications at scale

Learn to build reproducible distributed applications at scale with Python virtual environments, conda, and Apache Yarn.

Key takeaways
  • Use Python virtual environments to create reproducible applications at scale
  • Conda and pip can be used together, but require specific rules to avoid issues
  • Use a scheduler like Apache Yarn to manage distributed computing
  • Use pickle to serialize Python functions, but beware of issues with non-deterministic behavior
  • Cloud Pickle is a library that helps with serialization and deserialization of Python objects
  • Spark tries to execute Python tasks, but can struggle with issues like package inconsistencies and version differences
  • Use conda to create virtual environments and then upload them to distributed storage
  • PECs (Package Environment Cache System) is a tool that helps manage package environments and deployment
  • Use a “build-as-you-go” approach to avoid dependency issues and improve reproducibility
  • Docker can be used to create consistent environments, but requires careful management of dependencies
  • TensorFlow and Dask are examples of tools that can be used in distributed computing
  • The Python pickle module is used to serialize Python objects, but can have issues with non-deterministic behavior
  • Using Python virtual environments and package managers like conda can help improve reproducibility and ease deployment of distributed applications.