Talks - Michael Droettboom: Measuring the performance of CPython

Learn how Microsoft's CPython team measures Python performance using PyPerformance benchmarks, statistical techniques, and continuous testing to drive optimizations.

Key takeaways
  • The CPython Performance Engineering team at Microsoft uses PyPerformance Suite containing over 100 benchmarks to measure Python performance

  • Benchmarks are categorized into three main types:

    • Application benchmarks (full applications like Django CMS)
    • Toy benchmarks (simple <100 line programs)
    • Microbenchmarks (testing specific language features)
  • Key challenges in benchmarking include:

    • System noise from OS/other processes
    • CPU thermal management and speed variations
    • Memory layout randomization
    • Virtual machines adding additional noise
    • Benchmark warmup time
  • Performance improvements typically come from many small 1% optimizations stacked together rather than major breakthroughs

  • The team runs benchmarks on bare metal hardware to reduce noise, with typical noise levels around ±1% when properly controlled

  • Statistical techniques used:

    • Running benchmarks multiple times
    • Hierarchical Performance Testing (HPT)
    • Distribution analysis
    • Geometric mean for aggregating results
  • Most benchmarks spend time in different areas:

    • 54 benchmarks primarily in the interpreter
    • Others split between library code, memory management, kernel
    • Important to understand where each benchmark spends time
  • Continuous benchmarking helps evaluate changes:

    • Tests changes against main branch
    • Takes ~1.5-2.5 hours per run
    • Security considerations limit public access
  • Future needs include:

    • More real-world application benchmarks
    • Better parallel/threading benchmarks
    • Reduced benchmark runtime while maintaining value