Building a CML Pipeline with Spark & Kafka • Cameron Joannidis • YOW! 2018

Learn how to build a scalable ML pipeline using Spark & Kafka, with best practices for feature engineering, model deployment & performance optimization from real production experience

Key takeaways
  • Building a centralized feature store requires careful consideration of organization structure, funding model, and real usage patterns vs desired usage

  • Naive centralization of ML features is complex and expensive - feature engineering should be declarative and independently definable to avoid tight coupling

  • Performance optimizations focused on reducing I/O, sharing compute resources, and caching common aggregates. Moving from linear joins to parallel processing provided significant speedups

  • Feature generation should happen on-demand rather than maintaining a massive central table. Generate only what’s needed when needed.

  • Two main approaches for model serving: data-to-model (sending data to models) vs model-to-data (sending models to where data lives). Model-to-data is generally more efficient for large datasets.

  • Abstractions like feature transformers helped simplify feature engineering while maintaining performance optimizations under the hood

  • Using declarative feature definitions allowed the same logic to work for both batch and streaming use cases

  • Testing and versioning of features is critical - having declarative definitions makes this much easier

  • Understanding real usage patterns is key - most projects only needed a fraction of available features and data

  • Modern Spark versions (2.x+) provide significantly better performance compared to 1.x for these types of workloads