Nathan Goldbaum - From no CPython C API experience to shipping a new DType in NumPy 2.0

Python Testing

Learn how NumPy maintainer Nathan Goldbaum developed the new UTF-8 string dtype for NumPy 2.0, improving performance 5-500x through community collaboration and best practices.

Key takeaways

NumPy 2.0 introduced a new variable-width string dtype with UTF-8 encoding, improving upon previous fixed-width Unicode strings that used 4 bytes per character
The new string dtype implementation uses an arena allocator and supports short string optimization, making it 5-500x faster than previous object array implementations
Community collaboration was key - the work was funded through a NASA ROSES grant involving multiple scientific Python projects (NumPy, Pandas, SciPy, scikit-learn)
Getting started as a contributor/maintainer:
- Review code even if you’re not a maintainer
- Fix relevant bugs that block other people
- For every PR you submit, review another one
- Start with smaller tasks to get familiar with the codebase
Development best practices:
- Build prototypes before main implementation
- Write thorough tests for new features
- Use debuggers (GDB, LDB) to understand code
- Work in public but separate repos for experiments
Community engagement tips:
- Use project communication channels (Slack, mailing lists)
- Attend face-to-face meetings when available
- Don’t be afraid to ask for help
- Acknowledge that big projects are hard and it’s normal to struggle
The new string dtype was implemented to solve ecosystem-wide issues, particularly benefiting Pandas which previously relied on object arrays for strings
The NumPy Enhancement Proposal (NEP) process requires working prototypes before acceptance of major features

Nathan Goldbaum - From no CPython C API experience to shipping a new DType in NumPy 2.0

More talks