Cainã Max Couto da Silva - Intro to ML: How to Prevent Data Leakage and Build Efficient Workflows

Testing Automation

Learn essential techniques to prevent data leakage in machine learning workflows, build robust preprocessing pipelines, and avoid costly production mistakes.

Key takeaways

Data leakage occurs when training data contains information that allows models to make predictions they cannot make in real-world scenarios
Data leakage can lead to over-optimistic results and multi-million dollar mistakes in production
Best practices to prevent data leakage:
- Maintain clean separation between training and test sets
- Never preprocess entire dataset before splitting into train/test
- Apply preprocessing steps only on training data and use those parameters for test data
- Avoid using features that would not be available at prediction time
- Be careful with duplicated records across train/test splits
Use pipelines and transformers to properly handle data preprocessing:
- Learn parameters from training data only
- Apply same transformations to test/validation data
- Encapsulate all preprocessing steps together
- Save preprocessing state for reproducibility
Key preprocessing considerations:
- Handle missing values using training data statistics
- Scale numerical features based on training data parameters
- Encode categorical variables using training data categories
- Apply feature selection on training data only
Scikit-learn provides specialized classes for proper ML workflows:
- Transformers for preprocessing steps
- Estimators for models
- Pipelines to chain steps together
- Column transformers for column-specific operations
Cross-validation and proper test set handling are essential for reliable model evaluation
Many scientific papers and tutorials promote incorrect practices that introduce data leakage
AutoML frameworks and other ML libraries typically use similar pipeline concepts under the hood
Save entire pipeline as single object for deployment rather than saving transformers separately

Cainã Max Couto da Silva - Intro to ML: How to Prevent Data Leakage and Build Efficient Workflows

More talks