We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Ruan Pretorius - How to build a data pipeline without data | PyData Global 2023
Learn to build and test data pipelines using synthetic data with Faker and SQLAlchemy. Explore tools, patterns, and best practices when real data is unavailable or sensitive.
-
Synthetic data provides a way to build and test data pipelines when real data is unavailable, sensitive, or not yet collected
-
Key tools demonstrated:
- Faker: Python package for generating synthetic data
- SQLAlchemy: For database schema definition and operations
- Flyway: Version control and migration management for databases
-
Benefits of synthetic data:
- Speeds up development and testing
- Reduces risk of exposing sensitive data
- Allows control over data distributions and edge cases
- Enables testing before real data is available
-
Best practices for synthetic data:
- Define scope and purpose beforehand
- Only create data that’s necessary for testing
- Verify against real data schemas/business rules
- Document the generation process
- Consider how realistic the data needs to be
-
Faker features:
- Built-in methods for common data types (names, emails, addresses)
- Supports different locales for region-specific data
- Extensible through community providers
- Custom implementations possible
-
Database workflow:
- Define schema using SQLAlchemy classes
- Generate synthetic data
- Convert to SQL scripts
- Use Flyway for version control and migrations
- Enables repeatable deployments and collaborative development
-
Challenges:
- May not capture all nuances of real data
- Requires maintenance effort
- Limited realism compared to actual data
- Need to balance between realism and simplicity