Jacob Tomlinson - Accelerating fuzzy document deduplication to improve LLM training w/ RAPIDS & Dask

Learn how to accelerate document deduplication for LLM training using RAPIDS & Dask. Discover GPU-powered solutions that reduce processing time from 37 hours to just 3 hours.

Key takeaways
  • RAPIDS offers GPU-accelerated alternatives to popular PyData libraries like Pandas (cuDF), scikit-learn (cuML), and NetworkX (cuGraph), providing significant performance improvements without code changes

  • Using RAPIDS with Dask enables distributed computing across GPU clusters, allowing processing of multi-terabyte datasets that wouldn’t fit in single-machine memory

  • The QDF.pandas library acts as a drop-in replacement for Pandas, automatically accelerating operations where beneficial and falling back to CPU when necessary, offering 50-130x speedups

  • For large-scale document deduplication, the workflow combines:

    • MinHash/LSH for initial document grouping
    • Jaccard similarity for comparing document pairs
    • GPU acceleration for text processing and similarity calculations
    • Distributed processing across multiple machines
  • Real-world example showed processing time reduction from 37 hours on 20 CPU nodes to 3 hours on 4 GPU nodes, demonstrating both performance and cost benefits

  • RAPIDS can be deployed on various platforms:

    • Cloud providers (AWS, GCP, Azure)
    • Kubernetes clusters
    • Local development environments
    • Managed services like SageMaker and Vertex AI
  • The combination of RAPIDS and Dask provides:

    • Lazy evaluation for efficient memory usage
    • Automatic task scheduling and distribution
    • Visual dashboards for monitoring computations
    • Seamless scaling from laptop to cluster
  • Common LLM data preprocessing pipeline includes:

    • Data downloading and extraction
    • Format standardization
    • Text cleaning
    • Deduplication
    • Quality filtering