We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Cesar Garcia - Improving Open Data Quality using Python | PyData Global 2023
Learn how to improve open data quality with Python and Great Expectations library, from domain exploration to automated validation checks that ensure accuracy and consistency.
-
Data quality assessment should focus on exploring the data domain first before diving into technical validation
-
Great Expectations is a Python library for data validation that allows creating reusable data quality checks through “expectations” that can be documented and tracked
-
Key data quality dimensions to consider include accuracy, completeness, consistency, credibility and currentness
-
When working with open data, don’t trust metadata descriptions blindly - validate actual data content against stated requirements
-
Grid Expectations works with multiple backends including Pandas, Apache Spark, SQLite, MySQL and cloud data warehouses
-
Data validation workflow:
- Create data context
- Define data source
- Create expectations suite
- Run validation
- Generate documentation
- Fix issues identified
-
Common data quality issues in open datasets:
- Inconsistent date formats
- Missing values
- Duplicate IDs
- Incorrect data types
- Undocumented codes/categories
-
Data quality checks should be automated and reproducible rather than one-off manual fixes
-
Documentation of data quality requirements and validation results helps communicate issues to stakeholders
-
Data quality is context-dependent - requirements should align with the intended purpose and use case of the data