Fine-tuning large models on local hardware — Benjamin Bossan

Learn how to efficiently fine-tune large language models on local hardware using LoRa and other memory reduction techniques. Practical tips and implementation guidance.

Key takeaways

LoRa (Low Rank Adapters) is a popular parameter-efficient fine-tuning method that significantly reduces memory requirements for training large models
Key memory reduction techniques:
- Using lower precision (float16/int8/int4)
- Combining LoRa with quantization (Q-LoRa)
- Only calculating gradients for trainable parameters (~1% of total)
- Freezing base model weights
Training large models typically requires 4x the model size in memory due to:
- Base model weights
- Gradients
- Optimizer states
- Activations
Practical tips for implementation:
- Start small with quick end-to-end test runs
- Target linear layers by default
- Try prompting techniques first before fine-tuning
- Use higher learning rates with LoRa compared to full fine-tuning
- Merge LoRa weights into base model after training
Common misconceptions:
- LoRa doesn’t make inference faster or reduce inference memory
- Adding LoRa actually increases total parameters but reduces training memory
- Full fine-tuning still provides best performance if resources allow
Advanced features:
- Multiple LoRa adapters can be trained for different tasks
- Adapters can be merged or disabled as needed
- Compatible with distributed training (DDP, DeepSpeed, FSDP)
- Supports Torch Compile for optimization
Real example: Llama 3 8B model memory requirements reduced from 56GB to 15GB using these techniques

Fine-tuning large models on local hardware — Benjamin Bossan

More talks