Dean Pleban - Customizing and Evaluating LLMs, an Ops Perspective | PyData Global 2023

Testing

Learn practical approaches to customizing LLMs using RAG, PEFT, and fine-tuning, plus strategies for evaluation and deployment. Insights from an MLOps perspective.

Key takeaways

Different levels of LLM customization exist, from easy to complex:
- Prompt engineering and custom logic (easy)
- PEFT and LoRA for efficient domain adaptation (medium)
- Full model fine-tuning and retraining (hard)
RAG (Retrieval Augmented Generation) is a powerful customization method that:
- Works well with other customization techniques
- Is simpler to update than model retraining
- Doesn’t require changing the base model
Evaluation of LLMs should include:
- Automated metrics (ROUGE, etc.)
- Human evaluation and feedback
- Testing for biases, toxicity, and edge cases
- Domain-specific evaluation criteria
Tools recommended for LLM customization and evaluation:
- MLflow for experiment tracking
- Label Studio for human evaluation
- Giscard and DeepCheck for automated testing
- Hugging Face’s evaluate and PEFT libraries
Key challenges in LLM customization:
- Finding high-quality domain-specific training data
- High computational costs for fine-tuning
- Maintaining ongoing updates and maintenance
- Balancing technical requirements with domain expertise
Customization strategy recommendations:
- Start with simpler methods (RAG, prompt engineering) before complex ones
- Consider using multiple LLMs for different use cases
- Combine multiple customization approaches when needed
- Validate with early users and collect feedback
Important considerations for production deployment:
- Establish clear evaluation metrics
- Create comprehensive test suites
- Monitor performance in production
- Collect and incorporate user feedback
- Ensure regular updates and maintenance

Dean Pleban - Customizing and Evaluating LLMs, an Ops Perspective | PyData Global 2023

More talks