We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Jakub Kramata - Big Bank Data in Migration: From In-House CSV to Parquet in Amazon S3
Learn how a major bank tackled CSV data migration challenges to AWS S3, from handling corrupted files to building custom verification tools and choosing the right frameworks.
- 
    
Organization migrated data warehouse from Teradata ASTER to AWS cloud storage (S3), using CSV files as intermediate transfer format
 - 
    
Key challenges with CSV files included:
- Inconsistent formatting and separators
 - Wrong/corrupted metadata
 - Data type mismatches
 - Mixed up columns
 - Missing rows
 - Decimal precision issues
 
 - 
    
Initially tried using Pandas framework but switched to Dask due to:
- Better handling of large files (20-30GB+)
 - Distributed processing capability
 - Similar API to Pandas
 - Lazy evaluation benefits
 
 - 
    
Developed “SumChecker” solution to verify data integrity:
- Calculates row-level hashes in both source and destination
 - Compares hashes to detect discrepancies
 - Handles primary key columns
 - Uses fnv1a hashing algorithm for performance
 - Validates proper CSV parsing and column counts
 
 - 
    
Technical implementation details:
- Converts all data to strings before hashing
 - Handles missing values explicitly
 - Rounds decimals to 2 places for consistency
 - Sorts columns before comparison
 - Supports multiple data sources
 - Works with both CSV and Parquet formats
 
 - 
    
Considered but rejected alternatives:
- Aggregate metrics approach (too resource-intensive)
 - Polars framework (too new/unproven at the time)
 - Direct database connections