That’s it?! Dealing with unexpected data problems [PyCon DE & PyData Berlin 2024]

Python

Learn systematic approaches for handling unexpected data issues, from basic fixes to complex solutions. With practical tips for documentation, working with experts, and knowing when to pivot.

Key takeaways

Be systematic and structured when dealing with data problems - start with the simplest fixes before escalating to more complex solutions
Document everything - create searchable documentation of lessons learned, failed approaches, and hidden data logics to prevent others from repeating mistakes
Leverage domain experts early and often - they understand data generation processes, hidden assumptions, and daily usage patterns that may not be obvious
Consider scaling back project scope - a working model on a subset of data is better than no model at all
Explore standard data engineering fixes first:
- Data normalization
- Outlier detection
- Imputation
- Type conversions
- Deduplication
Don’t overlook data collection as a solution - sometimes gathering new, clean data is more efficient than trying to fix bad data
Look for natural experiments or smaller datasets that might provide meaningful insights rather than forcing large, problematic datasets to work
Consider simpler models or more restricted approaches when data quality is poor - explicit assumptions and constraints can help
Use failed projects as leverage to push for better data governance and quality in your organization
Know when to stop - if fixes aren’t working after systematic attempts, be willing to admit the project isn’t viable and document why

That’s it?! Dealing with unexpected data problems [PyCon DE & PyData Berlin 2024]

More talks