BJ Hargrave- Open Source Community Instruction-tuning of Large Language Models | PyData Vermont 2024

Learn how InstructLab enables open-source community instruction tuning of LLMs using teacher-student models, synthetic data generation, and a structured taxonomy approach.

Key takeaways

InstructLab is a new open-source project by IBM and Red Hat that enables community-driven instruction tuning of Large Language Models (LLMs)
The project uses a “teacher-student” model approach:
- Large teacher models generate synthetic training data
- Smaller student models learn from this data
- This reduces costs while maintaining capabilities
Contributions are organized in a taxonomy with two main types:
- Knowledge recipes (facts and information)
- Compositional skill recipes (how to perform tasks)
The synthetic data generation process includes:
- Question generation by teacher model
- Answer generation based on provided knowledge
- Quality evaluation of outputs
- Fine-tuning the student model with validated data
Key advantages of the approach:
- Allows community contributions similar to open source software
- Addresses the problem of limited new training data
- Enables training on private/enterprise data
- More cost-effective than using large models like ChatGPT
Technical implementation details:
- Runs on Apple Silicon Macs
- Uses 4-bit quantized models
- Supports GGUF format
- Available via pip install
- Compatible with models like Mistral and Llama
Project manages training quality through:
- Replay buffers to prevent forgetting
- Even distribution of training across taxonomy
- Evaluation of synthetic data quality
- Balance between different skills and knowledge areas

BJ Hargrave- Open Source Community Instruction-tuning of Large Language Models | PyData Vermont 2024

More talks