We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Building a CML Pipeline with Spark & Kafka • Cameron Joannidis • YOW! 2018
Learn how to build a scalable ML pipeline using Spark & Kafka, with best practices for feature engineering, model deployment & performance optimization from real production experience
-
Building a centralized feature store requires careful consideration of organization structure, funding model, and real usage patterns vs desired usage
-
Naive centralization of ML features is complex and expensive - feature engineering should be declarative and independently definable to avoid tight coupling
-
Performance optimizations focused on reducing I/O, sharing compute resources, and caching common aggregates. Moving from linear joins to parallel processing provided significant speedups
-
Feature generation should happen on-demand rather than maintaining a massive central table. Generate only what’s needed when needed.
-
Two main approaches for model serving: data-to-model (sending data to models) vs model-to-data (sending models to where data lives). Model-to-data is generally more efficient for large datasets.
-
Abstractions like feature transformers helped simplify feature engineering while maintaining performance optimizations under the hood
-
Using declarative feature definitions allowed the same logic to work for both batch and streaming use cases
-
Testing and versioning of features is critical - having declarative definitions makes this much easier
-
Understanding real usage patterns is key - most projects only needed a fraction of available features and data
-
Modern Spark versions (2.x+) provide significantly better performance compared to 1.x for these types of workloads