We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Jakub Kramata - Big Bank Data in Migration: From In-House CSV to Parquet in Amazon S3
Learn how a major bank tackled CSV data migration challenges to AWS S3, from handling corrupted files to building custom verification tools and choosing the right frameworks.
-
Organization migrated data warehouse from Teradata ASTER to AWS cloud storage (S3), using CSV files as intermediate transfer format
-
Key challenges with CSV files included:
- Inconsistent formatting and separators
- Wrong/corrupted metadata
- Data type mismatches
- Mixed up columns
- Missing rows
- Decimal precision issues
-
Initially tried using Pandas framework but switched to Dask due to:
- Better handling of large files (20-30GB+)
- Distributed processing capability
- Similar API to Pandas
- Lazy evaluation benefits
-
Developed “SumChecker” solution to verify data integrity:
- Calculates row-level hashes in both source and destination
- Compares hashes to detect discrepancies
- Handles primary key columns
- Uses fnv1a hashing algorithm for performance
- Validates proper CSV parsing and column counts
-
Technical implementation details:
- Converts all data to strings before hashing
- Handles missing values explicitly
- Rounds decimals to 2 places for consistency
- Sorts columns before comparison
- Supports multiple data sources
- Works with both CSV and Parquet formats
-
Considered but rejected alternatives:
- Aggregate metrics approach (too resource-intensive)
- Polars framework (too new/unproven at the time)
- Direct database connections