Rapid Feature Harvesting Using DFS & Data Engineering Techniques • Ananth Gundabattula • YOW! 2019

Discover how to rapidly harvest new features using Depth-First Search and data engineering techniques, automating the costly process and generating features in parallel with Apache Calcite for improved scalability and efficiency.

Key takeaways
  • Rapid Feature Harvesting is a technique that combines DFS (Depth-First Search) and data engineering to quickly generate new features from large datasets.
  • Feature engineering is a costly process, especially for large datasets, and there are efforts to automate it.
  • The speaker demonstrates a feature harvesting library that can generate features in parallel, using a graph-based approach.
  • The library uses a base feature definition to generate new features by applying relationships between columns.
  • Features can be categorized as direct, aggregation, or join-based, depending on the strategy used to generate them.
  • The speaker discusses the importance of lineage and metadata in feature engineering, and how the library provides features for managing these aspects.
  • He also mentions the challenge of feature explosion, where too many features are generated, making it difficult to identify the most relevant ones.
  • The library uses Apache Calcite to optimize the feature generation process, making it more efficient and scalable.
  • The speaker discusses the application of the feature harvesting library in various domains, including retail, finance, and healthcare.
  • He also mentions the potential for applying the technique in other areas, such as streaming data and real-time analytics.
  • Example features generated include the “max of average transaction amounts across sections” and the “count of transactions per customer”.
  • The speaker highlights the potential benefits of the technique, including reduced development time and cost, and improved feature selection.