🎉 AI Engineers: Join our free webinar on how to improve your RAG performance. Secure your spot >

May 23, 2024 - last updated
GenAI For Practitioners

Revolutionizing AI fine-tuning: A closer look at LoRA

Low-Rank Adaptation or LoRA is a groundbreaking solution that addresses these problems. It gives a lot of control over an LLM’s potential to adapt to downstream tasks, taking into account compute, memory and storage efficiency as well as the model’s latency.

Gadi Bessudo
Gadi Bessudo

As a Solution Architect specializing in LLM Guardrails, I design and implement safeguards to ensure the safe and responsible use of AI technologies, focusing on system integrity and user trust.

11 min read May 08, 2024

Large Language Models (LLMs) like GPT-4, Llama-2, and Claude have shown tremendous capabilities for solving language tasks. However, they are mostly unreliable for solving domain-specific downstream tasks, such as portfolio optimization or drug discovery. A model trained for summarization may not adapt well to machine reading comprehension, question answering, or coding tasks. Moreover, pre-trained LLMs can exhibit issues like bias and hallucinations.

This is why researchers fine-tune LLMs for specific tasks. Such models have improved reasoning capabilities and reduced hallucinations. For instance, Med-PaLM and BloombergGPT are fine-tuned LLMs for medical and finance domains. Or Meta’s Llama-2 has multiple fine-tuned variants for coding and question-answering tasks.

The trade-off between LLM quality and efficiency

However, fine-tuning is a prohibitively expensive process. It usually requires retraining all of the model parameters – known as full fine-tuning. Once retained, the parameters need to be stored, making it a critical storage and deployment challenge. 

To mitigate fine-tuning issues, researchers have developed techniques like adding adapter layers in the LLM or applying prefix tuning to improve the operational efficiency of the fine-tuning process. However, these approaches may introduce further issues, including:

  • Increased inference latency
  • Reduced usable input sequence length
  • And, failure to match the fine-tuning baselines

As a result, there is a trade-off between an LLM’s model quality and efficiency.

Introducing LoRA

Low-Rank Adaptation or LoRA is a groundbreaking solution that addresses these problems. It gives a lot of control over an LLM’s potential to adapt to downstream tasks, taking into account compute, memory and storage efficiency as well as the model’s latency.

Note: You might have already read quite a bit online about LoRA, and you might have thought to yourself something like “Oh, so LoRA is simply an adapter, added on top of an LLM!”.

Well, that’s not exactly the case.

There seems to be some confusion online on what LoRA is exactly, and how it works. LoRA is not an adapter.

So what is an adapter? What is LoRA? And how are the two concepts related?

Let’s find out more about LoRA in detail below.

What is LoRA?

Introduced by Microsoft in 2021, LoRA is a parameter-efficient fine-tuning technique (PEFT) used to adapt general-purpose large language models for task-specific domains that the model may not have seen during pre-training using a small set of (additional) trainable parameters. This is known as low-dimension reparameterization.

An illustration of LoRA architecture

LoRA is inspired by a 2020 Meta research titled: Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning which empirically shows that pre-trained models have a low intrinsic dimension. Meaning, that models can be fine-tuned using a small set of pre-training weights and still achieve similar performance as full fine-tuning. Hence, LoRA is a special case of full fine-tuning.

As a result of low-dimension reparameterization, LoRA-based fine-tuning provides many advantages for downstream task adaption, including:

  • Storage efficiency: LoRA significantly reduces the storage requirement by not using the full pre-trained weights matrix during fine-tuning, leading to a smaller checkpoint size. For instance, LoRA drastically reduces the GPT-3 checkpoint size from 1.2 TB to 35 MB.
  • Easier domain adaptation: Since LoRA uses a small set of weights, fine-tuning operations can be performed in-memory instead of on-disk. For instance, LoRA decreases the GPT-3 model size from 175B to 4.7M parameters and still gives a comparable performance. As a result, adapting LLMs to any number of downstream tasks becomes extremely easy, thereby reducing the task-switching overhead.
  • Hardware efficiency: LoRA minimizes the hardware barrier during fine-tuning. Compared to pre-training, where 100s of GPUs are required to train large models, LoRA can fine-tune a model with significantly lesser GPU memory requirements. For instance, LoRA can fine-tune a GPT-3 scale model with up to three times lesser GPU memory requirements.
  • No inference latency: The fine-tuned weights obtained using LoRA can be merged with the pre-trained model weights using simple matrix addition (explained below). Hence, there is no inference latency when the fine-tuned model is deployed.
  • Compatible with other Fine-tuning Methods: LoRA can combine with prior methods like prefix tuning and adapters (discussed below) to give better or comparable results.

Now, let’s take a deep dive into the technical understanding of how LoRA operates, what low rank and adaptation mean in LoRA, and how it updates trainable parameters.

The inner workings of LoRA

Consider a common pre-trained LLM built on top of the transformer architecture containing multi-head attention and multi-layer perceptron (MLP) layers. With LoRA, we freeze all the pre-trained model weights and introduce a small set of weights into each dense layer of the transformer. 

During full fine-tuning, the dense layers perform a full rank matrix multiplication to find fine-tuned weights. Basically, we modify all pre-trained model weights and calculate their respective gradients based on the domain-specific dataset.

In comparison, LoRA’s set of weights are optimized low-rank decomposition matrices. Meaning, that LoRA tunes two low-dimensional matrices based on the rank hyperparameter and multiplies them to calculate the fine-tuned weight matrix. 

To refresh your matrix theory, the rank of the matrix is equal to its number of linearly independent rows or columns, i.e., rows or columns that cannot be represented or calculated using other rows or columns of the matrix. For example, in a 3×3 identity matrix, all rows or columns are linearly independent. The rank of such a matrix would be equal to 3. How is this concept translated into LoRA?

As mentioned above, we have two low-dimensional matrices in LoRA. Initially, one matrix is initialized using a normal distribution and the other is initialized to 0. Then, based on the fine-tuning objective, the backpropagation process finds the right values for the two matrices. They are then multiplied to obtain the fine-tuned weight matrix that is equal to the size of the original pre-trained weight matrix.

Final weights are calculated by adding the pre-trained weights with the fine-tuned weights and the model is ready to make inference on the domain-specific task. 

An illustration of low-rank matrix decomposition in LoRA

LoRA is different from adapters

Introduced in 2019, adapters are another popular LLM fine-tuning technique that adds only a few trainable parameters for a downstream task. They inject new lightweight modules or layers between the layers of the original pre-trained model. So for every multi-head attention and MLP sub-block in the transformer architecture, an adapter layer is added and its weights are updated according to the downstream task.

Overview of Adapter architecture. Source

Adapters drastically improve the fine-tuning operations by adding only a small number of trainable parameters per downstream task. But here’s the catch: the transformer layers are executed in parallel using GPUs. But, the adapter layers are executed sequentially, thereby increasing the inference latency for fine-tuned models.

Unlike adapters, LoRA adds no inference latency since it performs simple matrix operations. Moreover, it only adapts the attention layer weights for the downstream tasks and freezes the rest of the transformer weights, making it more parameter-efficient.

Combining LoRA with other fine-tuning techniques

LoRA can practically integrate with any other fine-tuning methods like prefix tuning or adapters to achieve specific objectives, like adapting your language translation or sentiment analysis models to specialized domains at a fraction of storage and memory. This can be made possible by treating the trainable parameters of these techniques as the trainable parameters of LoRA. 

For instance, in prefix tuning, special tokens are added to the input sequence to improve the input prompt. These tokens can be treated as trainable parameters. Hence, LoRA can be applied to adapt these trainable parameters to the downstream task. 

Similarly, in adapters, additional layers are added to the attention and MLP sub-blocks. The weights of these additional layers can be updated using LoRA to reduce the sequential processing overhead of adapters.

The benefits of LoRA

LoRA researchers ran several experiments to test its fine-tuning performance against other parameter-efficient and full fine-tuning approaches. The experiments included fine-tuning RoBERTa, GPT-2, and GPT-3 models using multiple adapter variations, prefix tuning, prompt tuning, bias vector tuning, and full fine-tuning approaches on several evaluation benchmarks like BLEU, ROGUE, CIDEr, MNLI, etc. 

The table below illustrates LoRA’s fine-tuning capabilities on the GPT-3 175B model for several benchmarks. It either outperforms or gives comparable outcomes to other fine-tuning techniques while using a fraction of trainable parameters.

Performance comparison of different fine-tuning approaches with LoRA. Source

Overall, LoRA holds an unparalleled edge over other methods due to:

  • Computational efficiency with storage and memory optimization.
  • Flexibility in adapting LLMs to new tasks and domains.
  • Making state-of-the-art models more financially viable to small-scale AI labs and individual researchers.

More use cases with LoRA

With LoRA, stakeholders across critical domains are more equipped to adapt large foundational models to downstream tasks and domains to make a significant impact. Such as: 

Healthcare 

LLMs have made significant strides in patient care, medical education, and research. They can converse with patients, analyze doctors’ notes, summarize literature, and provide treatment plans. LoRA-enhanced LLMs are more equipped to handle healthcare data, such as medical literature, research findings, clinical notes, prescriptions, lab results, etc. Researchers can quickly fine-tune specialized models to power clinical decision support systems, accelerate drug development, and build better patient engagement platforms. 

Autonomous vehicles

LLMs are ‘driving’ a lot of innovation in the autonomous vehicles domain. With LoRA, researchers can quickly build models that can interpret complex traffic scenarios, generate driving scene simulations via natural language commands, assist drivers in adapting to autonomous driving policies in new locations, and provide accident analysis and prevention strategies.

Personalized education

LoRA-powered LLMs can help develop specialized learning tools and tailored study materials across subjects and class levels. Educators and students can leverage such LLMs to enhance productivity and make learning more interactive. Moreover, LoRA can quickly help build multilingual LLMs to support a diverse student population in the classroom.

Addressing fine-tuning challenges

Despite LoRA’s transformative impact across domains, there are still some fine-tuning challenges that need to be addressed – mainly AI hallucinations, profanity, and off-topic detection. 

While a standardized fine-tuning task and sufficient training data can help reduce these challenges, there is no guarantee. This can adversely affect your LLM’s trustworthiness.

One solution is to use LoRA in conjunction with other fine-tuning techniques like adapters or prefix tuning. However, configuring the parameters for these techniques adds another challenge to the already complex fine-tuning pipeline.

LoRA and Aporia

A simpler approach is to integrate Aporia Guardrails with your LLM applications. It adds a middleware security and protection layer on top of your LLM to check the integrity of its responses and make corrections in real-time. Effectively mitigating hallucinations, profanity, and off-topic responses in real time.

Solutions like Aporia Guardrails and fine-tuning techniques such as LoRA help address many challenges. And practitioners must consider adapting them in their LLM pipelines. But, LLM researchers are still figuring out the possibilities where an LLM can go wrong. Hence, future problems with LLMs will be solved using a variety of different innovative tools and techniques. Practitioners must actively try out new techniques to decide which one suits their requirements.

Conclusion

Let’s recap! Since its release, LoRA has had a transformative impact on the AI landscape. It has made training and fine-tuning language models more efficient, accessible, and adaptable.

What do we get with LoRA?

  • LoRA is an alternative to full fine-tuning. It is not an adapter layer.
  • It is a parameter-efficient fine-tuning model designed to reduce the size of trainable parameters.
  • Instead of adding additional layers in the transformer architecture, LoRA manages fine-tuning operations via low-rank matrix multiplication and addition. As a result, models incur no inference latency.
  • By changing the rank hyperparameter, you can balance and control an LLM’s overfitting and underfitting percentage.
  • LoRA drastically reduces the model checkpoint size and minimizes GPU memory requirements, saving thousands of dollars on hardware costs.
  • LoRA makes it easier to adapt one foundational pre-trained model to multiple fine-tuned models for various downstream tasks.
  • Aporia Guardrails and LoRA ensures that your GenAI app performs reliably and according to your use case’s needs. 

How are you using LoRA in your domain? Share your insights with us.

Secure and protect the integrity of your GenAI apps with Aporia Guardrails. 

Get a Demo Today!

Green Background

Control All your GenAI Apps in minutes