Nathan Goldbaum - From no CPython C API experience to shipping a new DType in NumPy 2.0

Learn how NumPy maintainer Nathan Goldbaum developed the new UTF-8 string dtype for NumPy 2.0, improving performance 5-500x through community collaboration and best practices.

Key takeaways
  • NumPy 2.0 introduced a new variable-width string dtype with UTF-8 encoding, improving upon previous fixed-width Unicode strings that used 4 bytes per character

  • The new string dtype implementation uses an arena allocator and supports short string optimization, making it 5-500x faster than previous object array implementations

  • Community collaboration was key - the work was funded through a NASA ROSES grant involving multiple scientific Python projects (NumPy, Pandas, SciPy, scikit-learn)

  • Getting started as a contributor/maintainer:

    • Review code even if you’re not a maintainer
    • Fix relevant bugs that block other people
    • For every PR you submit, review another one
    • Start with smaller tasks to get familiar with the codebase
  • Development best practices:

    • Build prototypes before main implementation
    • Write thorough tests for new features
    • Use debuggers (GDB, LDB) to understand code
    • Work in public but separate repos for experiments
  • Community engagement tips:

    • Use project communication channels (Slack, mailing lists)
    • Attend face-to-face meetings when available
    • Don’t be afraid to ask for help
    • Acknowledge that big projects are hard and it’s normal to struggle
  • The new string dtype was implemented to solve ecosystem-wide issues, particularly benefiting Pandas which previously relied on object arrays for strings

  • The NumPy Enhancement Proposal (NEP) process requires working prototypes before acceptance of major features