Moving from Offline to Online Machine Learning with River [PyCon DE & PyData Berlin 2024]

Python

Learn best practices for building documentation search with RAG, including chunking strategies, cost optimization, and evaluation. See real examples from scikit-learn.

Key takeaways

RAG (Retrieval Augmented Generation) can be used to create better documentation search systems, combining traditional search with LLMs
Chunking strategy is critical for RAG success - different document types require different chunking approaches:
- API documentation needs parameter-level chunking
- User guides work better with section-level chunks
- Must avoid cutting sections in the middle
Cost and performance considerations:
- CPU-only solutions are much slower but cheaper than GPU
- BM25 (lexical search) is 100x faster than BERT-based retrieval
- Consider costs of API calls and GPU resources at scale
Multi-stage retrieval improves results:
- Use BM25 for initial document retrieval
- Re-rank results using BERT/transformers
- Pass relevant chunks to LLM for final answer
Local deployment benefits:
- More control over the stack
- Lower costs than cloud APIs
- Better for prototyping and development
Evaluation and feedback are challenging for open source projects:
- Hard to collect user interaction data
- No clear metrics for success
- Need to manually check failure cases
Important to validate and verify LLM outputs:
- LLMs can hallucinate incorrect information
- Context/retrieval quality directly impacts answer quality
- Consider using zero-shot prompting
For Scikit-learn specifically:
- Different documentation types need different handling
- API docs, user guides, and examples require unique approaches
- Structure from NumPyDoc helps with parsing

Moving from Offline to Online Machine Learning with River [PyCon DE & PyData Berlin 2024]

More talks