Build an AI Document Inquiry Chat with Offline LLMs [PyCon DE & PyData Berlin 2024]

Python Ai

Learn how to build an offline AI chat system using local LLMs, RAG, and document processing. Covers quantization, tokenization, chunking, and practical implementation with Ragna.

Key takeaways

RAG (Retrieval Augmented Generation) allows using local documents with LLMs, providing context for more accurate and verifiable answers
Quantization is crucial for running LLMs locally, reducing model size by converting float32 weights to lower precision (4-8 bits), enabling GPU usage with minimal performance impact
Tokenization is model-specific - each LLM has its own dictionary/vocabulary for converting text to numbers, making tokenizers non-interchangeable between models
Chunk size and overlap are critical for document processing:
- Default is 500 tokens per chunk
- 30-50% chunk overlap recommended for best results
- Poor chunking can break context and make it harder for LLMs to understand content
Ragna provides:
- Support for multiple LLMs and storage backends
- Async streaming capabilities
- Source verification (showing which document chunks were used)
- REST API, Python API, and web UI interfaces
Local LLM considerations:
- Need ~4GB VRAM per billion parameters
- 8-bit quantization provides good balance of performance vs memory
- Instruction-tuned models preferred for better task completion
Vector databases are commonly used but not required - any storage system implementing the source storage interface can work
Temperature setting of 0.0 recommended for RAG to ensure deterministic responses
Document extraction quality significantly impacts RAG performance - PDF parsing remains challenging
Fine-tuning vs RAG choice depends on use case:
- RAG better for dynamic knowledge bases
- Fine-tuning better for specific domain adaptation

Build an AI Document Inquiry Chat with Offline LLMs [PyCon DE & PyData Berlin 2024]

More talks