Talks - Kevin Kho, Han Wang: Speed is Not All You Need for Data Processing

Python Testing

Explore why speed isn't everything in data processing. Learn about tool selection, development experience, and the true costs of data projects beyond raw performance.

Key takeaways

Speed benchmarks often lack important context and don’t tell the full story of tool selection - factors like developer experience, testability, and scalability matter more
~90% of data projects are “small data” that can be handled by tools like Pandas/Polars/DuckDB, while only ~10% truly need distributed computing solutions like Spark
The total cost of data projects includes both infrastructure and developer costs - choosing tools based purely on performance can lead to higher maintenance and developer overhead
Abstraction layers (like Fugue) can provide the best of both worlds - allowing development/testing in local environments while enabling seamless scaling to distributed computing when needed
Different tools (Pandas, Spark, DuckDB etc.) are designed for different personas and use cases - they should be viewed as complementary rather than mutually exclusive
Benchmark results can be manipulated by changing context, excluding I/O time, or selecting specific scenarios - organizations should test tools in their own specific context
Code testability is often overlooked but critical - distributed compute tools tend to be less testable compared to local development
SQL and Python each have their strengths - combining them through abstraction layers can leverage benefits of both rather than choosing one exclusively
Infrastructure costs need to be weighed against development speed and maintainability - faster execution may not be worth higher complexity
When evaluating tools, consider the full ecosystem, developer productivity, and ability to scale rather than focusing solely on performance metrics

Talks - Kevin Kho, Han Wang: Speed is Not All You Need for Data Processing

More talks