Low-Rank Adaptation, LoRA, is a mathematical technique proposed by Microsoft researchers as an efficient and cheaper alternative to full fine-tuning. LoRA freezes the pre-trained model weights and injects trainable low-rank matrices into layers of a transformer. It reduces the number of trainable parameters, hence the computational burden. Researchers found that LoRA could fine-tune GPT-3 175B with 10,000 times fewer trainable parameters and three times fewer GPU memory requirements vs. alternative methods.

Why LoRA?

Adapting large language models to specific domains, tasks, or datasets can be resource-intensive, requiring powerful GPUs and thousands, if not millions of dollars. However, most cases do not require the whole model, i.e., every parameter, to be modified. LoRA is built on this assumption. It allows developers to control the number of parameters to train and merge Low-Rank weights with the original one to minimize latency.

Advantages of LoRA:

LoRA finds a subset to train to achieve similar results as the full fine-tune. If that subset is 1%, LoRA can fine-tune a model with 100x less resources. This results in:

  1. Reduced training time: LoRA reduces the time required for fine-tuning from weeks to days, if not hours, as it requires less computing power to complete the task.
  2. Accessibility: Less computing power also means less powerful GPUs, including consumer GPUs. They are easier and cheaper to acquire, making fine-tuning accessible for more developers.
  3. No additional inference time: LoRA merges trainable matrices with the frozen weights when deployed, resulting in no additional inference latency.
  4. Easier Task Switching: LoRA simplifies task switching, as developers can easily create multiple models with different specializations and change LoRA weights instead of all parameters.
  5. Cost Savings: LoRA minimizes the required hardware for fine-tuning and hosting independent instances for different tasks, resulting in cost savings.

LoRA has cons, too. Since LoRA fine-tunes the model for a specific task, data, or domain, the model has to be adjusted when something different is introduced.

Effectiveness of LoRA

Researchers noted that their evaluations had proven LoRA offers comparable or better performance compared to the other efficient fine-tuning methods and full fine-tuning. It’s important to note that LoRA was published in 2021. In May 2023, University of Washington researchers introduced QLoRA , a more efficient variant of LoRA that first quantizes a model, hence Q in the name.

Every enterprise focuses on its competitive edge. Researching how to train, fine-tune, and deploy AI models requires different expertise. Hence, leading or even following the latest advances in AI is challenging for those who use language models to stay competitive. If you’re unsure how to train, fine-tune, and deploy large language models, engage with Picovoice Consulting and choose what works best for your project!

Consult an Expert