From built-in concurrency primitives to large scale distributed computing — Jakub Urban

Python

Learn about Python's concurrency tools, from built-in primitives to distributed computing frameworks like Dask and Ray, and discover best practices for scaling applications.

Key takeaways

Python provides powerful built-in concurrency primitives through modules like concurrent.futures, threading, multiprocessing, and asyncio
The concurrent.futures module (introduced in Python 3.5) offers high-level abstractions for concurrent execution through ThreadPoolExecutor and ProcessPoolExecutor
Concurrency enables executing multiple tasks simultaneously, while parallelism specifically refers to executing tasks in parallel across multiple processing units
Key limitations to consider:
- Global Interpreter Lock (GIL) for threading
- Memory constraints for process-based parallelism
- Serialization challenges with pickle
For scaling beyond a single machine, frameworks like Dask and Ray provide:
- Distributed computing capabilities
- Data management across workers
- Resource management and scheduling
- Fault tolerance
- Integration with async/await
Best practices for concurrent/parallel processing:
- Profile code before optimization
- Process data in chunks when possible
- Consider resource limitations (CPU, memory)
- Use memory mapping for large datasets
- Choose appropriate executor based on workload type (I/O vs CPU-bound)
Common use cases for concurrency:
- Web servers
- API calls
- Data processing
- Machine learning workloads
- Grid search operations
Both Dask and Ray build upon similar concepts as concurrent.futures but add capabilities for:
- Distributed execution
- Data serialization
- Cluster management
- Worker coordination

From built-in concurrency primitives to large scale distributed computing — Jakub Urban

More talks