Dean Pleban - Customizing and Evaluating LLMs, an Ops Perspective | PyData Global 2023

Learn practical approaches to customizing LLMs using RAG, PEFT, and fine-tuning, plus strategies for evaluation and deployment. Insights from an MLOps perspective.

Key takeaways
  • Different levels of LLM customization exist, from easy to complex:

    • Prompt engineering and custom logic (easy)
    • PEFT and LoRA for efficient domain adaptation (medium)
    • Full model fine-tuning and retraining (hard)
  • RAG (Retrieval Augmented Generation) is a powerful customization method that:

    • Works well with other customization techniques
    • Is simpler to update than model retraining
    • Doesn’t require changing the base model
  • Evaluation of LLMs should include:

    • Automated metrics (ROUGE, etc.)
    • Human evaluation and feedback
    • Testing for biases, toxicity, and edge cases
    • Domain-specific evaluation criteria
  • Tools recommended for LLM customization and evaluation:

    • MLflow for experiment tracking
    • Label Studio for human evaluation
    • Giscard and DeepCheck for automated testing
    • Hugging Face’s evaluate and PEFT libraries
  • Key challenges in LLM customization:

    • Finding high-quality domain-specific training data
    • High computational costs for fine-tuning
    • Maintaining ongoing updates and maintenance
    • Balancing technical requirements with domain expertise
  • Customization strategy recommendations:

    • Start with simpler methods (RAG, prompt engineering) before complex ones
    • Consider using multiple LLMs for different use cases
    • Combine multiple customization approaches when needed
    • Validate with early users and collect feedback
  • Important considerations for production deployment:

    • Establish clear evaluation metrics
    • Create comprehensive test suites
    • Monitor performance in production
    • Collect and incorporate user feedback
    • Ensure regular updates and maintenance