Fine-tuning large models on local hardware — Benjamin Bossan

Learn how to efficiently fine-tune large language models on local hardware using LoRa and other memory reduction techniques. Practical tips and implementation guidance.

Key takeaways
  • LoRa (Low Rank Adapters) is a popular parameter-efficient fine-tuning method that significantly reduces memory requirements for training large models

  • Key memory reduction techniques:

    • Using lower precision (float16/int8/int4)
    • Combining LoRa with quantization (Q-LoRa)
    • Only calculating gradients for trainable parameters (~1% of total)
    • Freezing base model weights
  • Training large models typically requires 4x the model size in memory due to:

    • Base model weights
    • Gradients
    • Optimizer states
    • Activations
  • Practical tips for implementation:

    • Start small with quick end-to-end test runs
    • Target linear layers by default
    • Try prompting techniques first before fine-tuning
    • Use higher learning rates with LoRa compared to full fine-tuning
    • Merge LoRa weights into base model after training
  • Common misconceptions:

    • LoRa doesn’t make inference faster or reduce inference memory
    • Adding LoRa actually increases total parameters but reduces training memory
    • Full fine-tuning still provides best performance if resources allow
  • Advanced features:

    • Multiple LoRa adapters can be trained for different tasks
    • Adapters can be merged or disabled as needed
    • Compatible with distributed training (DDP, DeepSpeed, FSDP)
    • Supports Torch Compile for optimization
  • Real example: Llama 3 8B model memory requirements reduced from 56GB to 15GB using these techniques