Braxton Cuneo - Injecting Python Functions into a Template-Driven CUDA C++ Framework | SciPy 2024

Learn how to inject Python functions into CUDA C++ code while abstracting GPU complexity. See real examples of bridging languages via templates and FFI for scientific computing.

Key takeaways
  • Framework allows injecting Python functions into CUDA C++ code while abstracting away GPU complexity from nuclear scientists and domain experts

  • Utilizes templates and FFI (Foreign Function Interface) to bridge Python and C++, with Harmonize serving as middleware between MCDC (Python framework) and CUDA

  • Asynchronous programming model where calls are not immediately executed but can be scheduled and potentially run on different hardware

  • System handles memory management, divergence reduction, and GPU-specific optimizations automatically so scientists can focus on physics/algorithms

  • Special handling required for data types and alignment issues when working between Python/Numba and CUDA:

    • Cannot use nested records directly
    • Must handle zero-size arrays carefully
    • Need proper alignment for struct members
  • Provides automatic management for:

    • Device memory allocation
    • Data movement between CPU/GPU
    • Work scheduling
    • Thread coordination
  • Performance optimizations include:

    • Shared memory usage
    • Thread divergence reduction
    • Load balancing
    • Locality optimization
  • Framework is generic and could potentially support:

    • AMD GPUs (planned)
    • Other language bindings
    • Different backend runtimes
  • Open source implementation available with automated tooling to handle complex linking and compilation steps

  • Particularly useful for Monte Carlo simulations requiring many parallel computations, like neutron transport problems