We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Talks - Saksham Sharma: A low latency deepdive of Python with Cython
Dive deep into Python performance optimization using Cython. Learn how to achieve near-C speeds by bypassing interpreter overhead through static typing and GIL release techniques.
-
Python’s interpreter (CPython) adds significant overhead by checking types and managing objects for each operation, making simple operations like addition take ~100ns vs 1-10ns in low-level code
-
Cython can help optimize performance by:
- Allowing static typing of variables and functions
- Converting Python code to C
- Bypassing the Python interpreter for core operations
- Enabling direct memory access through typed memoryviews
- Providing options to release the Global Interpreter Lock (GIL)
-
Raw untyped Cython code won’t improve performance much - explicit type declarations are needed to get significant speedups
-
For numerical operations, properly typed Cython code can achieve near-C performance:
- Basic integer operations: ~3-5ns
- Function calls: ~5-10ns
- Array access: ~180ns with typed memoryviews
-
Real-world considerations:
- I/O operations (disk, network) dwarf interpreter overhead
- Cython is best for optimizing compute-intensive inner loops
- Keep Pythonic APIs for users while optimizing core logic
- Balance between flexibility and performance is key
-
Performance profiling tools available:
- dis module shows Python bytecode
- Intel PMU provides CPU instruction counts
- Microbenchmarking helps establish baselines
-
Cython can achieve performance comparable to Rust/C++ when properly optimized, though requires more explicit type declarations and careful tuning
-
Best practices:
- Cache computations that don’t change
- Use typed memoryviews for array operations
- Explicitly declare types for performance-critical code
- Keep the Python interface clean while optimizing internals