Braxton Cuneo - Injecting Python Functions into a Template-Driven CUDA C++ Framework | SciPy 2024

Python

Learn how to inject Python functions into CUDA C++ code while abstracting GPU complexity. See real examples of bridging languages via templates and FFI for scientific computing.

Key takeaways

Framework allows injecting Python functions into CUDA C++ code while abstracting away GPU complexity from nuclear scientists and domain experts
Utilizes templates and FFI (Foreign Function Interface) to bridge Python and C++, with Harmonize serving as middleware between MCDC (Python framework) and CUDA
Asynchronous programming model where calls are not immediately executed but can be scheduled and potentially run on different hardware
System handles memory management, divergence reduction, and GPU-specific optimizations automatically so scientists can focus on physics/algorithms
Special handling required for data types and alignment issues when working between Python/Numba and CUDA:
- Cannot use nested records directly
- Must handle zero-size arrays carefully
- Need proper alignment for struct members
Provides automatic management for:
- Device memory allocation
- Data movement between CPU/GPU
- Work scheduling
- Thread coordination
Performance optimizations include:
- Shared memory usage
- Thread divergence reduction
- Load balancing
- Locality optimization
Framework is generic and could potentially support:
- AMD GPUs (planned)
- Other language bindings
- Different backend runtimes
Open source implementation available with automated tooling to handle complex linking and compilation steps
Particularly useful for Monte Carlo simulations requiring many parallel computations, like neutron transport problems

Braxton Cuneo - Injecting Python Functions into a Template-Driven CUDA C++ Framework | SciPy 2024

More talks