Talks - Kevin Kho, Han Wang: Speed is Not All You Need for Data Processing

Explore why speed isn't everything in data processing. Learn about tool selection, development experience, and the true costs of data projects beyond raw performance.

Key takeaways
  • Speed benchmarks often lack important context and don’t tell the full story of tool selection - factors like developer experience, testability, and scalability matter more

  • ~90% of data projects are “small data” that can be handled by tools like Pandas/Polars/DuckDB, while only ~10% truly need distributed computing solutions like Spark

  • The total cost of data projects includes both infrastructure and developer costs - choosing tools based purely on performance can lead to higher maintenance and developer overhead

  • Abstraction layers (like Fugue) can provide the best of both worlds - allowing development/testing in local environments while enabling seamless scaling to distributed computing when needed

  • Different tools (Pandas, Spark, DuckDB etc.) are designed for different personas and use cases - they should be viewed as complementary rather than mutually exclusive

  • Benchmark results can be manipulated by changing context, excluding I/O time, or selecting specific scenarios - organizations should test tools in their own specific context

  • Code testability is often overlooked but critical - distributed compute tools tend to be less testable compared to local development

  • SQL and Python each have their strengths - combining them through abstraction layers can leverage benefits of both rather than choosing one exclusively

  • Infrastructure costs need to be weighed against development speed and maintainability - faster execution may not be worth higher complexity

  • When evaluating tools, consider the full ecosystem, developer productivity, and ability to scale rather than focusing solely on performance metrics