We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Jacob Tomlinson - Accelerating fuzzy document deduplication to improve LLM training w/ RAPIDS & Dask
Learn how to accelerate document deduplication for LLM training using RAPIDS & Dask. Discover GPU-powered solutions that reduce processing time from 37 hours to just 3 hours.
- 
    RAPIDS offers GPU-accelerated alternatives to popular PyData libraries like Pandas (cuDF), scikit-learn (cuML), and NetworkX (cuGraph), providing significant performance improvements without code changes 
- 
    Using RAPIDS with Dask enables distributed computing across GPU clusters, allowing processing of multi-terabyte datasets that wouldn’t fit in single-machine memory 
- 
    The QDF.pandas library acts as a drop-in replacement for Pandas, automatically accelerating operations where beneficial and falling back to CPU when necessary, offering 50-130x speedups 
- 
    For large-scale document deduplication, the workflow combines: - MinHash/LSH for initial document grouping
- Jaccard similarity for comparing document pairs
- GPU acceleration for text processing and similarity calculations
- Distributed processing across multiple machines
 
- 
    Real-world example showed processing time reduction from 37 hours on 20 CPU nodes to 3 hours on 4 GPU nodes, demonstrating both performance and cost benefits 
- 
    RAPIDS can be deployed on various platforms: - Cloud providers (AWS, GCP, Azure)
- Kubernetes clusters
- Local development environments
- Managed services like SageMaker and Vertex AI
 
- 
    The combination of RAPIDS and Dask provides: - Lazy evaluation for efficient memory usage
- Automatic task scheduling and distribution
- Visual dashboards for monitoring computations
- Seamless scaling from laptop to cluster
 
- 
    Common LLM data preprocessing pipeline includes: - Data downloading and extraction
- Format standardization
- Text cleaning
- Deduplication
- Quality filtering