We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Shashank Shekhar - LLMs: Beyond the Hype - A Practical Journey to Scale | PyData Global 2023
Learn practical strategies for scaling LLM applications, from architecture and cost optimization to RAG implementation, monitoring, security and performance tuning.
- 
    LLM application architecture typically consists of 6 key components: model selection, prompt template, vector database, LLM agents/tools, orchestrator, and monitoring module 
- 
    Cost optimization strategies include: - Using smaller models for simple queries
- LLM caching to store and reuse responses
- Quantization for model compression
- On-premise deployment for long-term savings
- Request hedging between expensive and cheaper models
 
- 
    RAG (Retrieval Augmented Generation) improves accuracy by: - Retrieving relevant documents from corpus
- Chunking information into blocks
- Using vector databases for efficient storage/retrieval
- Providing external knowledge context to LLM
 
- 
    Evaluation and monitoring requirements: - Offline testing before human interaction
- Monitoring latency, accuracy, throughput
- Collecting user feedback
- Measuring contextual relevance
- Tracking potential drift
 
- 
    Key technical considerations: - Model licensing for commercial use
- Proper tokenizer and embedding model selection
- GPU/CPU resource optimization
- Vector database selection and setup
- Guardrails for safety and reliability
 
- 
    Performance improvement techniques: - Prompt engineering and optimization
- Fine-tuning for specific tasks
- Model quantization for faster inference
- Caching frequent queries
- Using specialized models for different tasks
 
- 
    Security and responsible AI aspects: - Content filtering
- Input validation
- Handling sensitive data
- Managing hallucinations
- Bias detection