BJ Hargrave- Open Source Community Instruction-tuning of Large Language Models | PyData Vermont 2024

Learn how InstructLab enables open-source community instruction tuning of LLMs using teacher-student models, synthetic data generation, and a structured taxonomy approach.

Key takeaways
  • InstructLab is a new open-source project by IBM and Red Hat that enables community-driven instruction tuning of Large Language Models (LLMs)

  • The project uses a “teacher-student” model approach:

    • Large teacher models generate synthetic training data
    • Smaller student models learn from this data
    • This reduces costs while maintaining capabilities
  • Contributions are organized in a taxonomy with two main types:

    • Knowledge recipes (facts and information)
    • Compositional skill recipes (how to perform tasks)
  • The synthetic data generation process includes:

    • Question generation by teacher model
    • Answer generation based on provided knowledge
    • Quality evaluation of outputs
    • Fine-tuning the student model with validated data
  • Key advantages of the approach:

    • Allows community contributions similar to open source software
    • Addresses the problem of limited new training data
    • Enables training on private/enterprise data
    • More cost-effective than using large models like ChatGPT
  • Technical implementation details:

    • Runs on Apple Silicon Macs
    • Uses 4-bit quantized models
    • Supports GGUF format
    • Available via pip install
    • Compatible with models like Mistral and Llama
  • Project manages training quality through:

    • Replay buffers to prevent forgetting
    • Even distribution of training across taxonomy
    • Evaluation of synthetic data quality
    • Balance between different skills and knowledge areas