Build an AI Document Inquiry Chat with Offline LLMs [PyCon DE & PyData Berlin 2024]

Learn how to build an offline AI chat system using local LLMs, RAG, and document processing. Covers quantization, tokenization, chunking, and practical implementation with Ragna.

Key takeaways
  • RAG (Retrieval Augmented Generation) allows using local documents with LLMs, providing context for more accurate and verifiable answers

  • Quantization is crucial for running LLMs locally, reducing model size by converting float32 weights to lower precision (4-8 bits), enabling GPU usage with minimal performance impact

  • Tokenization is model-specific - each LLM has its own dictionary/vocabulary for converting text to numbers, making tokenizers non-interchangeable between models

  • Chunk size and overlap are critical for document processing:

    • Default is 500 tokens per chunk
    • 30-50% chunk overlap recommended for best results
    • Poor chunking can break context and make it harder for LLMs to understand content
  • Ragna provides:

    • Support for multiple LLMs and storage backends
    • Async streaming capabilities
    • Source verification (showing which document chunks were used)
    • REST API, Python API, and web UI interfaces
  • Local LLM considerations:

    • Need ~4GB VRAM per billion parameters
    • 8-bit quantization provides good balance of performance vs memory
    • Instruction-tuned models preferred for better task completion
  • Vector databases are commonly used but not required - any storage system implementing the source storage interface can work

  • Temperature setting of 0.0 recommended for RAG to ensure deterministic responses

  • Document extraction quality significantly impacts RAG performance - PDF parsing remains challenging

  • Fine-tuning vs RAG choice depends on use case:

    • RAG better for dynamic knowledge bases
    • Fine-tuning better for specific domain adaptation