Rapid Feature Harvesting Using DFS & Data Engineering Techniques • Ananth Gundabattula • YOW! 2019

Discover how to rapidly harvest new features using Depth-First Search and data engineering techniques, automating the costly process and generating features in parallel with Apache Calcite for improved scalability and efficiency.

Key takeaways

Rapid Feature Harvesting is a technique that combines DFS (Depth-First Search) and data engineering to quickly generate new features from large datasets.
Feature engineering is a costly process, especially for large datasets, and there are efforts to automate it.
The speaker demonstrates a feature harvesting library that can generate features in parallel, using a graph-based approach.
The library uses a base feature definition to generate new features by applying relationships between columns.
Features can be categorized as direct, aggregation, or join-based, depending on the strategy used to generate them.
The speaker discusses the importance of lineage and metadata in feature engineering, and how the library provides features for managing these aspects.
He also mentions the challenge of feature explosion, where too many features are generated, making it difficult to identify the most relevant ones.
The library uses Apache Calcite to optimize the feature generation process, making it more efficient and scalable.
The speaker discusses the application of the feature harvesting library in various domains, including retail, finance, and healthcare.
He also mentions the potential for applying the technique in other areas, such as streaming data and real-time analytics.
Example features generated include the “max of average transaction amounts across sections” and the “count of transactions per customer”.
The speaker highlights the potential benefits of the technique, including reduced development time and cost, and improved feature selection.

Rapid Feature Harvesting Using DFS & Data Engineering Techniques • Ananth Gundabattula • YOW! 2019

More talks