A Retrieval Augmented Generation system to query the scikit-learn documentation

Learn how to build a documentation search system using RAG, sentence-BERT, and Mistral LLM to improve scikit-learn's docs searchability while keeping costs low.

Key takeaways
  • RAG (Retrieval Augmented Generation) system was built to improve searchability of scikit-learn documentation by combining text retrieval with LLM-based responses

  • Different documentation types (API docs, user guides, tutorials) require different chunking strategies to effectively break down content for retrieval

  • The system uses sentence-BERT for encoding text and BM25 for retrieval, with a BERT-based re-ranker to improve result quality

  • Mistral was chosen as the LLM for being able to run locally and having reasonable performance on CPU/M2 hardware

  • Proper chunking strategy is critical - naive chunking often fails and needs to be adapted based on document structure and content type

  • Performance considerations: retrieval takes ~3 seconds on CPU, total response time around 5 seconds including LLM generation

  • Cost was a major factor - cloud API calls would be prohibitively expensive at scikit-learn’s scale of ~1M monthly documentation users

  • System lacks evaluation metrics due to no user interaction data being collected for privacy/GDPR reasons

  • The implementation focuses on pure Python code without complex dependencies for better maintainability

  • Current limitations include inability to process images in documentation and potential hallucinations from the LLM when context is ambiguous