We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Cainã Max Couto da Silva - Intro to ML: How to Prevent Data Leakage and Build Efficient Workflows
Learn essential techniques to prevent data leakage in machine learning workflows, build robust preprocessing pipelines, and avoid costly production mistakes.
- 
    Data leakage occurs when training data contains information that allows models to make predictions they cannot make in real-world scenarios 
- 
    Data leakage can lead to over-optimistic results and multi-million dollar mistakes in production 
- 
    Best practices to prevent data leakage: - Maintain clean separation between training and test sets
- Never preprocess entire dataset before splitting into train/test
- Apply preprocessing steps only on training data and use those parameters for test data
- Avoid using features that would not be available at prediction time
- Be careful with duplicated records across train/test splits
 
- 
    Use pipelines and transformers to properly handle data preprocessing: - Learn parameters from training data only
- Apply same transformations to test/validation data
- Encapsulate all preprocessing steps together
- Save preprocessing state for reproducibility
 
- 
    Key preprocessing considerations: - Handle missing values using training data statistics
- Scale numerical features based on training data parameters
- Encode categorical variables using training data categories
- Apply feature selection on training data only
 
- 
    Scikit-learn provides specialized classes for proper ML workflows: - Transformers for preprocessing steps
- Estimators for models
- Pipelines to chain steps together
- Column transformers for column-specific operations
 
- 
    Cross-validation and proper test set handling are essential for reliable model evaluation 
- 
    Many scientific papers and tutorials promote incorrect practices that introduce data leakage 
- 
    AutoML frameworks and other ML libraries typically use similar pipeline concepts under the hood 
- 
    Save entire pipeline as single object for deployment rather than saving transformers separately