Cesar Garcia - Improving Open Data Quality using Python | PyData Global 2023

Learn how to improve open data quality with Python and Great Expectations library, from domain exploration to automated validation checks that ensure accuracy and consistency.

Key takeaways
  • Data quality assessment should focus on exploring the data domain first before diving into technical validation

  • Great Expectations is a Python library for data validation that allows creating reusable data quality checks through “expectations” that can be documented and tracked

  • Key data quality dimensions to consider include accuracy, completeness, consistency, credibility and currentness

  • When working with open data, don’t trust metadata descriptions blindly - validate actual data content against stated requirements

  • Grid Expectations works with multiple backends including Pandas, Apache Spark, SQLite, MySQL and cloud data warehouses

  • Data validation workflow:

    • Create data context
    • Define data source
    • Create expectations suite
    • Run validation
    • Generate documentation
    • Fix issues identified
  • Common data quality issues in open datasets:

    • Inconsistent date formats
    • Missing values
    • Duplicate IDs
    • Incorrect data types
    • Undocumented codes/categories
  • Data quality checks should be automated and reproducible rather than one-off manual fixes

  • Documentation of data quality requirements and validation results helps communicate issues to stakeholders

  • Data quality is context-dependent - requirements should align with the intended purpose and use case of the data