LoRA Explained: Parameter-efficient fine-tuning

Redefining the Rules of AI by Empowering Efficient Fine-Tuning for Large Language Models and Diffusion Models

John Lu
5 min readJun 2, 2024

Introduction

Large Language Models (LLMs) are powerful tools for various language processing tasks. They start by learning from a vast amount of text data, picking up on common patterns and word associations. Once they have this foundational knowledge, they can be tailored to specific tasks like determining the sentiment behind a text.

The challenge is that LLMs are huge, with millions of settings (parameters) that could be adjusted. But when we’re fine-tuning them for a specific task, we often don’t need to tweak every single one of these settings — especially since the data we use for fine-tuning is usually much smaller than the data used for initial training. It’s like having a giant toolbox when you only need a few tools.

Low-Rank Adaptation (LoRA) is a technique that streamlines the fine-tuning process. It smartly picks a smaller set of parameters to adjust, cutting down on the time and computer power needed, without sacrificing the performance of the model.

In this blog post, we’ll dive into the technical details of how LoRA works and why it’s a game-changer for making LLMs more accessible and efficient.

But what is LoRA?

Think of LoRA as a smart way to update Large Language Models (LLMs) without having to retrain the whole thing. It’s like giving the model a quick and focused “refresher course” instead of relearning everything from scratch.

Here’s a simpler breakdown:

  • Normally, an LLM has a massive matrix (let’s call it W0) full of numbers that it uses to make decisions. This matrix is huge — imagine a grid that’s n rows by n columns.
  • With LoRA, instead of changing this big matrix, we create two smaller matrices, A and B. These are much narrower (with a width called rank) and fit neatly into the big matrix.
  • The rank is a lot smaller than n, making these new matrices easier to work with. In practice, using just 1 to 4 rows for rank does the trick.
  • By adjusting these smaller matrices, we can effectively update the LLM’s decision-making process without the heavy lifting of retraining the entire model.

So, LoRA is a clever shortcut that keeps the LLM’s knowledge up-to-date and ready for new tasks, all while saving a lot of time and computing resources. It’s like fine-tuning your car’s engine rather than building a new one from scratch.

LoRA equation and its concept

In the original setup, the output of a layer in the model is calculated using the formula:

output = W0x +b0

Here, x is the input you give to the model, W0 is a big grid of numbers (the weight matrix), and b0 is a set of adjustments (the bias terms). These are part of the model’s original, unchangeable layers.

Now, LoRA changes the game by adding a special twist to this formula:

output = W0x +b0 + BAx

In this new formula, A and B are smaller, more manageable grids of numbers (the rank-decomposition matrices). They’re like a mini-update pack for the model.

The idea behind LoRA is that we don’t need to overhaul the entire model to update it. Since the model originally had more parameters than necessary (it was over-parametrized), we can just tweak a small part of it. This is done by focusing on these smaller matrices, A and B, which represent a “low rank” update. It means we’re making significant changes without the need to adjust the massive W0 matrix.

By doing this, we can achieve results similar to retraining the whole model (full fine-tuning), but with much less effort and computing power. It’s a bit like updating just the parts of a car engine that need it, rather than building a new engine entirely. This makes fine-tuning faster and more efficient, while still keeping the model’s performance high.

https://arxiv.org/abs/2106.09685

Number of trainable parameters

Let’s break down the math for a clearer understanding:

1. Original Dense Layer W0:

  • The original dense layer W0 has a grid of numbers with dimensions 768 x 768.
  • So, the total number of parameters in W0 is 768 x 768 = 589,824.

2. LoRA Layers (A and B):

  • LoRA introduces two smaller matrices, A and B.
  • Both A and B have dimensions 768 x 4 and 4 x 768, respectively.
  • The combined parameters in A and B are 768 x 4 + 4 x 768 = 6,144.

3. Comparison:

  • When using LoRA, we replace the massive W0 with the smaller A and B matrices.
  • So, instead of dealing with 589,824 trainable parameters, we now have only 6,144 trainable parameters for the dense layer.

In summary, LoRA significantly reduces the number of parameters we need to fine-tune, making the process more efficient while maintaining performance! 🚀

Why LoRA reduces the memory footprint?

Imagine your computer’s memory as a backpack you’re packing for a trip. You have four main types of items to pack:

1. Model Memory (Model Weights):

  • This is like the clothes you need to pack. With LoRA, you might have a few more clothes than usual (since LoRA adds a bit more to the model), but not by much.

2. Forward Pass Memory (Running the Model):

  • Think of this as the snacks you’ll eat along the way. It doesn’t matter if you’re taking a new route (LoRA) or the old one (GPT-2); you’ll need the same amount of snacks.

3. Backward Pass Memory (Storing Gradients):

  • These are like the souvenirs you’ll collect. But you’ll only collect them for the new sights you see (trainable parameters). Since LoRA has fewer new sights, you’ll collect fewer souvenirs, saving space.

4. Optimizer Memory (Storing Optimizer State):

  • This is like the guidebooks for each place you’ll visit. The Adam optimizer needs one for each sight. But with fewer new sights in LoRA, you need fewer guidebooks.

Even though you’re adding a few new items (LoRA layers), you’re actually saving space because you’re bringing fewer guidebooks and souvenirs (trainable parameters). That’s how LoRA manages to reduce the overall memory needed, even though it seems like you’re packing more. It’s all about packing smarter, not more! 🎒

Why LoRA has gained popularity?

1. Reduced GPU Memory Usage:

  • When training models, GPU memory is like the workspace where the magic happens. LoRA is like a tidy desk — it uses less space. By focusing on just the essential parts of the model, it saves GPU memory. So, you can fit more models or bigger tasks into the same workspace.

2. Faster Training:

  • Imagine you’re learning a new skill. Instead of starting from scratch, you build on what you already know. LoRA does the same for models. It fine-tunes them faster because it doesn’t waste time relearning everything. It’s like leveling up without redoing the whole game.

3. No Extra Inference Latency:

  • Inference is when the model makes predictions (like answering questions). LoRA doesn’t slow down this process. It’s like having a well-practiced musician — no extra time needed to play the right notes. So, your AI stays snappy and responsive.

Conclusions

In summary, LoRA is like a smart, efficient assistant that gets the job done without clutter or delays. No wonder it’s catching everyone’s attention! 🌟

--

--

John Lu

AI Engineer. Deeply motivated by challenges and tends to be excited by breaking conventional ways of thinking and doing. He builds fun and creative apps.