Cainã Max Couto da Silva - Intro to ML: How to Prevent Data Leakage and Build Efficient Workflows

Learn essential techniques to prevent data leakage in machine learning workflows, build robust preprocessing pipelines, and avoid costly production mistakes.

Key takeaways
  • Data leakage occurs when training data contains information that allows models to make predictions they cannot make in real-world scenarios

  • Data leakage can lead to over-optimistic results and multi-million dollar mistakes in production

  • Best practices to prevent data leakage:

    • Maintain clean separation between training and test sets
    • Never preprocess entire dataset before splitting into train/test
    • Apply preprocessing steps only on training data and use those parameters for test data
    • Avoid using features that would not be available at prediction time
    • Be careful with duplicated records across train/test splits
  • Use pipelines and transformers to properly handle data preprocessing:

    • Learn parameters from training data only
    • Apply same transformations to test/validation data
    • Encapsulate all preprocessing steps together
    • Save preprocessing state for reproducibility
  • Key preprocessing considerations:

    • Handle missing values using training data statistics
    • Scale numerical features based on training data parameters
    • Encode categorical variables using training data categories
    • Apply feature selection on training data only
  • Scikit-learn provides specialized classes for proper ML workflows:

    • Transformers for preprocessing steps
    • Estimators for models
    • Pipelines to chain steps together
    • Column transformers for column-specific operations
  • Cross-validation and proper test set handling are essential for reliable model evaluation

  • Many scientific papers and tutorials promote incorrect practices that introduce data leakage

  • AutoML frameworks and other ML libraries typically use similar pipeline concepts under the hood

  • Save entire pipeline as single object for deployment rather than saving transformers separately