Shashank Shekhar - LLMs: Beyond the Hype - A Practical Journey to Scale | PyData Global 2023

Testing Security

Learn practical strategies for scaling LLM applications, from architecture and cost optimization to RAG implementation, monitoring, security and performance tuning.

Key takeaways

LLM application architecture typically consists of 6 key components: model selection, prompt template, vector database, LLM agents/tools, orchestrator, and monitoring module
Cost optimization strategies include:
- Using smaller models for simple queries
- LLM caching to store and reuse responses
- Quantization for model compression
- On-premise deployment for long-term savings
- Request hedging between expensive and cheaper models
RAG (Retrieval Augmented Generation) improves accuracy by:
- Retrieving relevant documents from corpus
- Chunking information into blocks
- Using vector databases for efficient storage/retrieval
- Providing external knowledge context to LLM
Evaluation and monitoring requirements:
- Offline testing before human interaction
- Monitoring latency, accuracy, throughput
- Collecting user feedback
- Measuring contextual relevance
- Tracking potential drift
Key technical considerations:
- Model licensing for commercial use
- Proper tokenizer and embedding model selection
- GPU/CPU resource optimization
- Vector database selection and setup
- Guardrails for safety and reliability
Performance improvement techniques:
- Prompt engineering and optimization
- Fine-tuning for specific tasks
- Model quantization for faster inference
- Caching frequent queries
- Using specialized models for different tasks
Security and responsible AI aspects:
- Content filtering
- Input validation
- Handling sensitive data
- Managing hallucinations
- Bias detection

Shashank Shekhar - LLMs: Beyond the Hype - A Practical Journey to Scale | PyData Global 2023

More talks