Jacob Tomlinson - Accelerating fuzzy document deduplication to improve LLM training w/ RAPIDS & Dask

Python

Learn how to accelerate document deduplication for LLM training using RAPIDS & Dask. Discover GPU-powered solutions that reduce processing time from 37 hours to just 3 hours.

Key takeaways

RAPIDS offers GPU-accelerated alternatives to popular PyData libraries like Pandas (cuDF), scikit-learn (cuML), and NetworkX (cuGraph), providing significant performance improvements without code changes
Using RAPIDS with Dask enables distributed computing across GPU clusters, allowing processing of multi-terabyte datasets that wouldn’t fit in single-machine memory
The QDF.pandas library acts as a drop-in replacement for Pandas, automatically accelerating operations where beneficial and falling back to CPU when necessary, offering 50-130x speedups
For large-scale document deduplication, the workflow combines:
- MinHash/LSH for initial document grouping
- Jaccard similarity for comparing document pairs
- GPU acceleration for text processing and similarity calculations
- Distributed processing across multiple machines
Real-world example showed processing time reduction from 37 hours on 20 CPU nodes to 3 hours on 4 GPU nodes, demonstrating both performance and cost benefits
RAPIDS can be deployed on various platforms:
- Cloud providers (AWS, GCP, Azure)
- Kubernetes clusters
- Local development environments
- Managed services like SageMaker and Vertex AI
The combination of RAPIDS and Dask provides:
- Lazy evaluation for efficient memory usage
- Automatic task scheduling and distribution
- Visual dashboards for monitoring computations
- Seamless scaling from laptop to cluster
Common LLM data preprocessing pipeline includes:
- Data downloading and extraction
- Format standardization
- Text cleaning
- Deduplication
- Quality filtering

Jacob Tomlinson - Accelerating fuzzy document deduplication to improve LLM training w/ RAPIDS & Dask

More talks