We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Cainã Max Couto da Silva - Intro to ML: How to Prevent Data Leakage and Build Efficient Workflows
Learn essential techniques to prevent data leakage in machine learning workflows, build robust preprocessing pipelines, and avoid costly production mistakes.
-
Data leakage occurs when training data contains information that allows models to make predictions they cannot make in real-world scenarios
-
Data leakage can lead to over-optimistic results and multi-million dollar mistakes in production
-
Best practices to prevent data leakage:
- Maintain clean separation between training and test sets
- Never preprocess entire dataset before splitting into train/test
- Apply preprocessing steps only on training data and use those parameters for test data
- Avoid using features that would not be available at prediction time
- Be careful with duplicated records across train/test splits
-
Use pipelines and transformers to properly handle data preprocessing:
- Learn parameters from training data only
- Apply same transformations to test/validation data
- Encapsulate all preprocessing steps together
- Save preprocessing state for reproducibility
-
Key preprocessing considerations:
- Handle missing values using training data statistics
- Scale numerical features based on training data parameters
- Encode categorical variables using training data categories
- Apply feature selection on training data only
-
Scikit-learn provides specialized classes for proper ML workflows:
- Transformers for preprocessing steps
- Estimators for models
- Pipelines to chain steps together
- Column transformers for column-specific operations
-
Cross-validation and proper test set handling are essential for reliable model evaluation
-
Many scientific papers and tutorials promote incorrect practices that introduce data leakage
-
AutoML frameworks and other ML libraries typically use similar pipeline concepts under the hood
-
Save entire pipeline as single object for deployment rather than saving transformers separately