Jakub Kramata - Big Bank Data in Migration: From In-House CSV to Parquet in Amazon S3

Learn how a major bank tackled CSV data migration challenges to AWS S3, from handling corrupted files to building custom verification tools and choosing the right frameworks.

Key takeaways

Organization migrated data warehouse from Teradata ASTER to AWS cloud storage (S3), using CSV files as intermediate transfer format
Key challenges with CSV files included:
- Inconsistent formatting and separators
- Wrong/corrupted metadata
- Data type mismatches
- Mixed up columns
- Missing rows
- Decimal precision issues
Initially tried using Pandas framework but switched to Dask due to:
- Better handling of large files (20-30GB+)
- Distributed processing capability
- Similar API to Pandas
- Lazy evaluation benefits
Developed “SumChecker” solution to verify data integrity:
- Calculates row-level hashes in both source and destination
- Compares hashes to detect discrepancies
- Handles primary key columns
- Uses fnv1a hashing algorithm for performance
- Validates proper CSV parsing and column counts
Technical implementation details:
- Converts all data to strings before hashing
- Handles missing values explicitly
- Rounds decimals to 2 places for consistency
- Sorts columns before comparison
- Supports multiple data sources
- Works with both CSV and Parquet formats
Considered but rejected alternatives:
- Aggregate metrics approach (too resource-intensive)
- Polars framework (too new/unproven at the time)
- Direct database connections

Jakub Kramata - Big Bank Data in Migration: From In-House CSV to Parquet in Amazon S3

More talks