Fabian Höring - Building reproducible distributed applications at scale

Python

Learn to build reproducible distributed applications at scale with Python virtual environments, conda, and Apache Yarn.

Key takeaways

Use Python virtual environments to create reproducible applications at scale
Conda and pip can be used together, but require specific rules to avoid issues
Use a scheduler like Apache Yarn to manage distributed computing
Use pickle to serialize Python functions, but beware of issues with non-deterministic behavior
Cloud Pickle is a library that helps with serialization and deserialization of Python objects
Spark tries to execute Python tasks, but can struggle with issues like package inconsistencies and version differences
Use conda to create virtual environments and then upload them to distributed storage
PECs (Package Environment Cache System) is a tool that helps manage package environments and deployment
Use a “build-as-you-go” approach to avoid dependency issues and improve reproducibility
Docker can be used to create consistent environments, but requires careful management of dependencies
TensorFlow and Dask are examples of tools that can be used in distributed computing
The Python pickle module is used to serialize Python objects, but can have issues with non-deterministic behavior
Using Python virtual environments and package managers like conda can help improve reproducibility and ease deployment of distributed applications.

More talks