We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Build an AI Document Inquiry Chat with Offline LLMs [PyCon DE & PyData Berlin 2024]
Learn how to build an offline AI chat system using local LLMs, RAG, and document processing. Covers quantization, tokenization, chunking, and practical implementation with Ragna.
-
RAG (Retrieval Augmented Generation) allows using local documents with LLMs, providing context for more accurate and verifiable answers
-
Quantization is crucial for running LLMs locally, reducing model size by converting float32 weights to lower precision (4-8 bits), enabling GPU usage with minimal performance impact
-
Tokenization is model-specific - each LLM has its own dictionary/vocabulary for converting text to numbers, making tokenizers non-interchangeable between models
-
Chunk size and overlap are critical for document processing:
- Default is 500 tokens per chunk
- 30-50% chunk overlap recommended for best results
- Poor chunking can break context and make it harder for LLMs to understand content
-
Ragna provides:
- Support for multiple LLMs and storage backends
- Async streaming capabilities
- Source verification (showing which document chunks were used)
- REST API, Python API, and web UI interfaces
-
Local LLM considerations:
- Need ~4GB VRAM per billion parameters
- 8-bit quantization provides good balance of performance vs memory
- Instruction-tuned models preferred for better task completion
-
Vector databases are commonly used but not required - any storage system implementing the source storage interface can work
-
Temperature setting of 0.0 recommended for RAG to ensure deterministic responses
-
Document extraction quality significantly impacts RAG performance - PDF parsing remains challenging
-
Fine-tuning vs RAG choice depends on use case:
- RAG better for dynamic knowledge bases
- Fine-tuning better for specific domain adaptation