We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Rapid Feature Harvesting Using DFS & Data Engineering Techniques • Ananth Gundabattula • YOW! 2019
Discover how to rapidly harvest new features using Depth-First Search and data engineering techniques, automating the costly process and generating features in parallel with Apache Calcite for improved scalability and efficiency.
-
Rapid Feature Harvesting
is a technique that combines DFS (Depth-First Search) and data engineering to quickly generate new features from large datasets. - Feature engineering is a costly process, especially for large datasets, and there are efforts to automate it.
- The speaker demonstrates a feature harvesting library that can generate features in parallel, using a graph-based approach.
- The library uses a base feature definition to generate new features by applying relationships between columns.
- Features can be categorized as direct, aggregation, or join-based, depending on the strategy used to generate them.
- The speaker discusses the importance of lineage and metadata in feature engineering, and how the library provides features for managing these aspects.
- He also mentions the challenge of feature explosion, where too many features are generated, making it difficult to identify the most relevant ones.
- The library uses Apache Calcite to optimize the feature generation process, making it more efficient and scalable.
- The speaker discusses the application of the feature harvesting library in various domains, including retail, finance, and healthcare.
- He also mentions the potential for applying the technique in other areas, such as streaming data and real-time analytics.
- Example features generated include the “max of average transaction amounts across sections” and the “count of transactions per customer”.
- The speaker highlights the potential benefits of the technique, including reduced development time and cost, and improved feature selection.