That’s it?! Dealing with unexpected data problems [PyCon DE & PyData Berlin 2024]

Learn systematic approaches for handling unexpected data issues, from basic fixes to complex solutions. With practical tips for documentation, working with experts, and knowing when to pivot.

Key takeaways
  • Be systematic and structured when dealing with data problems - start with the simplest fixes before escalating to more complex solutions

  • Document everything - create searchable documentation of lessons learned, failed approaches, and hidden data logics to prevent others from repeating mistakes

  • Leverage domain experts early and often - they understand data generation processes, hidden assumptions, and daily usage patterns that may not be obvious

  • Consider scaling back project scope - a working model on a subset of data is better than no model at all

  • Explore standard data engineering fixes first:

    • Data normalization
    • Outlier detection
    • Imputation
    • Type conversions
    • Deduplication
  • Don’t overlook data collection as a solution - sometimes gathering new, clean data is more efficient than trying to fix bad data

  • Look for natural experiments or smaller datasets that might provide meaningful insights rather than forcing large, problematic datasets to work

  • Consider simpler models or more restricted approaches when data quality is poor - explicit assumptions and constraints can help

  • Use failed projects as leverage to push for better data governance and quality in your organization

  • Know when to stop - if fixes aren’t working after systematic attempts, be willing to admit the project isn’t viable and document why