Jakub Kramata - Big Bank Data in Migration: From In-House CSV to Parquet in Amazon S3

Learn how a major bank tackled CSV data migration challenges to AWS S3, from handling corrupted files to building custom verification tools and choosing the right frameworks.

Key takeaways
  • Organization migrated data warehouse from Teradata ASTER to AWS cloud storage (S3), using CSV files as intermediate transfer format

  • Key challenges with CSV files included:

    • Inconsistent formatting and separators
    • Wrong/corrupted metadata
    • Data type mismatches
    • Mixed up columns
    • Missing rows
    • Decimal precision issues
  • Initially tried using Pandas framework but switched to Dask due to:

    • Better handling of large files (20-30GB+)
    • Distributed processing capability
    • Similar API to Pandas
    • Lazy evaluation benefits
  • Developed “SumChecker” solution to verify data integrity:

    • Calculates row-level hashes in both source and destination
    • Compares hashes to detect discrepancies
    • Handles primary key columns
    • Uses fnv1a hashing algorithm for performance
    • Validates proper CSV parsing and column counts
  • Technical implementation details:

    • Converts all data to strings before hashing
    • Handles missing values explicitly
    • Rounds decimals to 2 places for consistency
    • Sorts columns before comparison
    • Supports multiple data sources
    • Works with both CSV and Parquet formats
  • Considered but rejected alternatives:

    • Aggregate metrics approach (too resource-intensive)
    • Polars framework (too new/unproven at the time)
    • Direct database connections