Building a CML Pipeline with Spark & Kafka • Cameron Joannidis • YOW! 2018

Testing

Learn how to build a scalable ML pipeline using Spark & Kafka, with best practices for feature engineering, model deployment & performance optimization from real production experience

Key takeaways

Building a centralized feature store requires careful consideration of organization structure, funding model, and real usage patterns vs desired usage
Naive centralization of ML features is complex and expensive - feature engineering should be declarative and independently definable to avoid tight coupling
Performance optimizations focused on reducing I/O, sharing compute resources, and caching common aggregates. Moving from linear joins to parallel processing provided significant speedups
Feature generation should happen on-demand rather than maintaining a massive central table. Generate only what’s needed when needed.
Two main approaches for model serving: data-to-model (sending data to models) vs model-to-data (sending models to where data lives). Model-to-data is generally more efficient for large datasets.
Abstractions like feature transformers helped simplify feature engineering while maintaining performance optimizations under the hood
Using declarative feature definitions allowed the same logic to work for both batch and streaming use cases
Testing and versioning of features is critical - having declarative definitions makes this much easier
Understanding real usage patterns is key - most projects only needed a fraction of available features and data
Modern Spark versions (2.x+) provide significantly better performance compared to 1.x for these types of workloads

Building a CML Pipeline with Spark & Kafka • Cameron Joannidis • YOW! 2018

More talks