Cesar Garcia - Improving Open Data Quality using Python | PyData Global 2023

Python Testing Automation

Learn how to improve open data quality with Python and Great Expectations library, from domain exploration to automated validation checks that ensure accuracy and consistency.

Key takeaways

Data quality assessment should focus on exploring the data domain first before diving into technical validation
Great Expectations is a Python library for data validation that allows creating reusable data quality checks through “expectations” that can be documented and tracked
Key data quality dimensions to consider include accuracy, completeness, consistency, credibility and currentness
When working with open data, don’t trust metadata descriptions blindly - validate actual data content against stated requirements
Grid Expectations works with multiple backends including Pandas, Apache Spark, SQLite, MySQL and cloud data warehouses
Data validation workflow:
- Create data context
- Define data source
- Create expectations suite
- Run validation
- Generate documentation
- Fix issues identified
Common data quality issues in open datasets:
- Inconsistent date formats
- Missing values
- Duplicate IDs
- Incorrect data types
- Undocumented codes/categories
Data quality checks should be automated and reproducible rather than one-off manual fixes
Documentation of data quality requirements and validation results helps communicate issues to stakeholders
Data quality is context-dependent - requirements should align with the intended purpose and use case of the data

Cesar Garcia - Improving Open Data Quality using Python | PyData Global 2023

More talks