Ruan Pretorius - How to build a data pipeline without data | PyData Global 2023

Python Testing Automation

Learn to build and test data pipelines using synthetic data with Faker and SQLAlchemy. Explore tools, patterns, and best practices when real data is unavailable or sensitive.

Key takeaways

Synthetic data provides a way to build and test data pipelines when real data is unavailable, sensitive, or not yet collected
Key tools demonstrated:
- Faker: Python package for generating synthetic data
- SQLAlchemy: For database schema definition and operations
- Flyway: Version control and migration management for databases
Benefits of synthetic data:
- Speeds up development and testing
- Reduces risk of exposing sensitive data
- Allows control over data distributions and edge cases
- Enables testing before real data is available
Best practices for synthetic data:
- Define scope and purpose beforehand
- Only create data that’s necessary for testing
- Verify against real data schemas/business rules
- Document the generation process
- Consider how realistic the data needs to be
Faker features:
- Built-in methods for common data types (names, emails, addresses)
- Supports different locales for region-specific data
- Extensible through community providers
- Custom implementations possible
Database workflow:
- Define schema using SQLAlchemy classes
- Generate synthetic data
- Convert to SQL scripts
- Use Flyway for version control and migrations
- Enables repeatable deployments and collaborative development
Challenges:
- May not capture all nuances of real data
- Requires maintenance effort
- Limited realism compared to actual data
- Need to balance between realism and simplicity

Ruan Pretorius - How to build a data pipeline without data | PyData Global 2023

More talks