We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Shashank Shekhar - LLMs: Beyond the Hype - A Practical Journey to Scale | PyData Global 2023
Learn practical strategies for scaling LLM applications, from architecture and cost optimization to RAG implementation, monitoring, security and performance tuning.
-
LLM application architecture typically consists of 6 key components: model selection, prompt template, vector database, LLM agents/tools, orchestrator, and monitoring module
-
Cost optimization strategies include:
- Using smaller models for simple queries
- LLM caching to store and reuse responses
- Quantization for model compression
- On-premise deployment for long-term savings
- Request hedging between expensive and cheaper models
-
RAG (Retrieval Augmented Generation) improves accuracy by:
- Retrieving relevant documents from corpus
- Chunking information into blocks
- Using vector databases for efficient storage/retrieval
- Providing external knowledge context to LLM
-
Evaluation and monitoring requirements:
- Offline testing before human interaction
- Monitoring latency, accuracy, throughput
- Collecting user feedback
- Measuring contextual relevance
- Tracking potential drift
-
Key technical considerations:
- Model licensing for commercial use
- Proper tokenizer and embedding model selection
- GPU/CPU resource optimization
- Vector database selection and setup
- Guardrails for safety and reliability
-
Performance improvement techniques:
- Prompt engineering and optimization
- Fine-tuning for specific tasks
- Model quantization for faster inference
- Caching frequent queries
- Using specialized models for different tasks
-
Security and responsible AI aspects:
- Content filtering
- Input validation
- Handling sensitive data
- Managing hallucinations
- Bias detection