We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
That’s it?! Dealing with unexpected data problems [PyCon DE & PyData Berlin 2024]
Learn systematic approaches for handling unexpected data issues, from basic fixes to complex solutions. With practical tips for documentation, working with experts, and knowing when to pivot.
-
Be systematic and structured when dealing with data problems - start with the simplest fixes before escalating to more complex solutions
-
Document everything - create searchable documentation of lessons learned, failed approaches, and hidden data logics to prevent others from repeating mistakes
-
Leverage domain experts early and often - they understand data generation processes, hidden assumptions, and daily usage patterns that may not be obvious
-
Consider scaling back project scope - a working model on a subset of data is better than no model at all
-
Explore standard data engineering fixes first:
- Data normalization
- Outlier detection
- Imputation
- Type conversions
- Deduplication
-
Don’t overlook data collection as a solution - sometimes gathering new, clean data is more efficient than trying to fix bad data
-
Look for natural experiments or smaller datasets that might provide meaningful insights rather than forcing large, problematic datasets to work
-
Consider simpler models or more restricted approaches when data quality is poor - explicit assumptions and constraints can help
-
Use failed projects as leverage to push for better data governance and quality in your organization
-
Know when to stop - if fixes aren’t working after systematic attempts, be willing to admit the project isn’t viable and document why