We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Talks - Amitosh Swain: Testing Data Pipelines
Learn practical strategies for testing data pipelines, from end-to-end tests to production monitoring. Get tips on test data sampling, tool selection, and balancing coverage vs cost.
-
Data pipeline testing can be broken down into end-to-end testing, functional testing, unit testing, and production data quality checks
-
End-to-end testing should be the first priority when starting with testing - it provides broad coverage but is slower and more expensive to run
-
Separate orchestration code from pipeline logic to enable effective testing - avoid coupling pipeline code tightly with orchestrator frameworks like Airflow
-
Snapshot testing is valuable for data pipelines - capture known good state outputs and compare against new pipeline runs to validate behavior
-
Unit tests should focus on small chunks of pipeline logic with mock dependencies - they are fast, cheap and allow testing edge cases
-
Production data quality checks are essential - monitor data distributions, validate schemas, check for nulls/duplicates, and verify business rules
-
Consider costs when implementing testing - cloud resources and compute time for test runs can get expensive if not managed carefully
-
Use sampling techniques for test data rather than full datasets - 5-20 rows with key variations is often sufficient
-
Leverage existing tools like Great Expectations, SodaCore, and pandas testing utilities rather than building everything from scratch
-
Having a consistent data model and well-documented data dictionary helps enable more effective testing practices