Open Engineer

LoRA: 0 to 100

Alright, we've ascertained that performing full fine-tuning on an LLM involves updating a massive number of parameters and requires a lot of compute... more compute than most of us have access to or are willing to pay for. So that's a non-starter.

But what if there was a way to reduce the number of parameters that need to be updated at each training step (we call these trainable parameters)? What if we could take a model with 7 billion parameters and change its behavior by fine-tuning only 30 million parameters? These is precisely what LoRA aims to accomplish.

Brief Background

Believe it or not, the idea of reducing the number of trainable parameters in a model is not new. For many years, we've employed techniques like pruning to strip out unused parameters from the model and parameter sharing to group parameters together.

In the wheelhouse of LLM fine-tuning, it is not uncommon to see techniques like layer freezing where we only update parameters in the final few layers of the model. And prefix tuning where we train special "prefix" tokens that are prepended to the actual input tokens and guide the model's predictions.

However, the common denominator between these techniques is that they don't modify the model holistically. This is a significant limitation, as the specific circuitry that is most relevant to our fine-tuning task might be located anywhere within the model and typically involves a complex interplay between many layers.

LoRA offers an alternative — using clever linear algebra tricks, we can actually update an entire model with a fraction of the parameters. Crazy, right? To understand how this is possible, it is necessary to first understand LoRA's mathematical underpinning: low-rank factorization.

🔎 The next section is quite math-heavy. If you're new to linear algebra, take your time to understand the concepts and pull in additional resources. It's worth it!

Low-Rank Factorization

Matrix Rank

The rank of a matrix, put simply, is a measure of its information density (this is a crude analogy so don't get mad at me). If a matrix has a rank of 1, it means that all of the information in the matrix can be encoded in a single vector. If the matrix has a very high rank, it means that it encodes a lot more information. Take the following matrix for example:

A = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 3 & 6 & 9 \\ \end{bmatrix}

This matrix has a rank of 1 since the second and third columns are scalar multiples of the first column! The second column is just the first multiplied by 2, and the third column is just the first multiplied by 3. This effectively means that all the information in the matrix can be encoded in a single vector equal to the first column: $[1, 2, 3]$ . Now consider a second matrix:

B = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ 2 & 4 & 6 \\ \end{bmatrix}

What do you think the rank of this matrix is? It's 2! Notice that it is impossible to create the first row using a combination of the second and third rows. Put formally, there is no linear combination of the rows that can form the first row, meaning the first row is linearly independent of the other two. On the other hand, the second and third rows are linearly dependent since the third row is just the second row multiplied by 2. We can therefore encode the matrix's information using two vectors, $[1, 1, 1]$ and $[1, 2, 3]$ , giving us a rank of 2.

This brings us to an actual definition: the rank of a matrix is the maximum number of linearly independent rows or columns in the matrix. If the matrix has a rank of $r$ , it means that its contents can be expressed using only $r$ linearly independent vectors, as we did above.

We say that a matrix $A$ is full-rank if it has the maximum possible rank. In other words, the rank satisfies $\text{rank}(A) = \min(m, n)$ , where $m$ is the number of rows and $n$ is the number of columns. In this case, every row or column is linearly independent of the others, and the matrix encodes the maximum amount of information.

Conversely, we call a matrix low-rank (or rank-deficient) if it has a rank significantly less than the maximum possible rank: $\text{rank}(A) \ll \min(m, n)$ . For example, the first matrix we looked at would be considered low-rank since its rank of 1 is less than its maximum possible rank of 3!

Properties of Matrix Rank

Very quickly before moving on, there are two key properties of matrix rank that we need to know. We will lean on these properties later to understand exactly how LoRA works:

The rank of a matrix $A$ is constrained by the minimum of its number of rows $m$ and columns $n$ . So if you have a matrix with dimensions $(3, 2)$ , its maximum possible rank would be 2. This is fairly self-evident from the definition of a full-rank matrix, but it's worth repeating:

\text{rank}(A) \leq \min(m, n)

For two matrices $A$ and $B$ , the rank of their product $AB$ is constrained by their individual ranks. Intuitively, when we combine two matrices, the resulting matrix will encode only as much information as the least informative of the two matrices:

\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))

Rank Factorization

Now for the fun part. The rank factorization of a matrix $A$ is a way to factor it into two smaller matrices, $B$ and $C$ . Specifically, if $A$ has rank $r$ and dimensions $(m, n)$ , then there always exists a matrix $B$ with dimensions $(m, r)$ and a matrix $C$ with dimensions $(r, n)$ such that the following matrix product holds:

A = BC

Using this rule, the rank factorization of our first matrix $A$ with dimensions $(3, 3)$ and rank 1 would result in the a $(3, 1)$ matrix $B$ and a $(1, 3)$ matrix $C$ :

A = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 3 & 6 & 9 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ \end{bmatrix}

That was a lot at once, so let's break it down step by step using a concrete example. Imagine we have a matrix $X$ with dimensions $(100, 1000)$ that happens to have a matrix rank $r_X$ of 10. From the first property that we learned above, we know that the maximum possible rank of $X$ is $\min(m, n)$ . In this case, this corresponds to the number of rows, or 100. This means that $X$ is very low-rank ( $10 \ll 100$ ), encoding only a fraction of the information that it theoretically could.

So let's factorize $X$ into two smaller matrices $Y$ and $Z$ such that $X = YZ$ . This is typically done using an algorithm called singular value decomposition (SVD), but that's a topic for another day. The key point is that $Y$ will have dimensions $(m, r_X) = (100, 10)$ and $Z$ will have dimensions $(r_X, n) = (10, 1000)$ , meaning that the number of elements has decreased from 100,000 in $X$ to 11,000 in $YZ$ . Therefore, we have successfully represented the same amount of information as in $X$ but with only 11% of the original data volume (number of matrix elements)!

Note that $Y$ and $Z$ are always full-rank matrices. From our first property, we know that their maximum possible rank is $r_X$ due to their dimensions. But from the second property, we know that the rank of $Y$ and $Z$ set a ceiling on the rank of their product $YZ$ . Since we know $YZ$ 's rank is $r_X$ (because it is equal to $X$ ), this means that both $Y$ and $Z$ must have a minimum rank of $r_X$ . Putting both properties together, $Y$ and $Z$ assume a rank of exactly $r_X$ and, since this equals their smaller dimensions, they are full-rank.

Let's quickly intuit why all this is even possible. The $X$ matrix is very low-rank — it encodes far less information than it theoretically could at its size. This suggests that there are many repeated patterns in the data (like our first low-rank matrix from the previous section) that we can factor out using clever algorithms like SVD. In other words, we replace the noisy $X$ matrix with two smaller, high-signal matrices. We call this technique of rank factorization on low-rank matrices... low-rank factorization. We'll return to this momentarily.

Parameter Matrices

Now back to the world of deep learning. Before we can tie this all together, we need to understand how the parameters of a neural network (and therefore LLMs) are typically represented. If you guessed "as matrices", you're exactly right.

🔎 You're about to witness the world's fastest crash course on neural networks. This detour probably doesn't do justice to the complexity behind neural networks, so if you get lost, feel free to skip to the takeaway in the last paragraph.

The image below depicts a traditional 3-layer neural network. The input layer has 3 neurons, the hidden layer has 6 neurons, and the output layer has 1 neuron. The input layer doesn't actually perform any computation, it just describes the dimensionality (or number of features) of the input data. The second and third layers are where the magic happens: we call these linear (or dense or fully connected) layers, and they're responsible for performing the actual computations.

In a linear layer, each neuron is connected to every neuron in the preceding layer, using a simple linear equation. Consider the first neuron in the second layer (the first green circle). It takes the three values from the input layer, $x_1$ , $x_2$ , and $x_3$ , and multiplies each by its corresponding weights, $w_1$ , $w_2$ , and $w_3$ . Here's what this looks like, where $b$ is an additional trainable parameter called the bias and $y$ is the output of that first neuron:

y = w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b

🔎 You might be confused about the interplay between the terms "parameters" and "weights". Typically, parameters refer to any trainable value in the model, which includes both weights and biases. In practice, however, the two terms are often used interchangeably.

Guess what! We can represent this equation using a vector dot product, where $\vec{w}$ is a vector of the weights and $\vec{x}$ is a vector of the inputs:

y = \vec{w} \cdot \vec{x} + b

\vec{w} = \begin{bmatrix} w_1 \\ w_2 \\ w_3 \\ \end{bmatrix} \quad \vec{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ \end{bmatrix}

So far, we've only modeled the output of a single neuron in the second layer. Can we instead model the outputs of all 6 neurons in the second layer? Yes, using matrix multiplications:

Y = XW + b

Our input vector $\vec{x}$ becomes a $(1, 3)$ input matrix $X$ . If we wanted to pass multiple inputs to the layer at once, $X$ would become a $(B, 3)$ input matrix, where $B$ is the batch size.
Our weight vector $\vec{w}$ becomes the first column of our $(3, 6)$ weight matrix $W$ , which contains the weights of all 6 neurons in the second layer. Each neuron has its own column. The dots in the matrix represent the weights of the other neurons.
Our bias $b$ becomes the first column of a $(1, 6)$ bias matrix $b$ . Each neuron has a single bias. Similarly to the weights matrix, the dots in the matrix represent the biases of the other neurons.
The output of the second layer is then a $(1, 6)$ output matrix $Y$ : each column contains the output of a single neuron. Notice that the first column of $Y$ contains the same output that we derived above for the first neuron. If we passed multiple inputs to the layer at once, $Y$ would become a $(B, 6)$ output matrix.

Y = \begin{bmatrix} x_1 & x_2 & x_3 \\ \end{bmatrix} \begin{bmatrix} w_1 & \dots \\ w_2 & \dots \\ w_3 & \dots \\ \end{bmatrix} + \begin{bmatrix} b & \dots \\ \end{bmatrix}

Y = \begin{bmatrix} w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b & \dots \\ \end{bmatrix}

Congratulations — if you've read this far, we've just devised a clean formula to represent the outputs of an entire linear layer! Following this approach, the third layer would be represented as a $(6, 1)$ weight matrix $W_2$ and a $(1, 1)$ bias matrix $b_2$ . In practice, we would then wrap this output in a non-linear activation function like ReLU or tanh before passing it to the next layer:

Y = \text{tanh}(XW + b)

🔎 See if you can figure out the pattern in the dimensions of these matrices. Hint: the first dimension of

W

is the number of neurons in the previous layer and the second dimension is the number of neurons in its own layer.

The primary takeaway here is that we can represent each layer as a set of parameter matrices. In our simple 3-layer network, the parameters are primarily contained in the weight matrices $W_1$ and $W_2$ . And this is also true for LLMs! All the parameters of LLMs, for both attention and feedforwards layers, are represented as parameter matrices.

Putting LoRA Together

Key Insight

Consider an arbitrary parameter matrix $W$ of an LLM that we're fine-tuning. Since full fine-tuning will update every parameter in the model, we can represent the final fine-tuned $W'$ matrix as the element-wise sum of our original $W$ matrix and a new matrix $\Delta W$ that represents the updates for each parameter. In this reframing, $W$ is frozen and $\Delta W$ contains the deltas from all the newly fine-tuned parameters:

W' = W + \Delta W

The key insight behind LoRA is that the $\Delta W$ matrix is low-rank. In other words, $\Delta W$ has a low information density. The fine-tuning process isn't actually leveraging the flexibility of all the available trainable parameters and is instead re-learning the same patterns over and over again in $\Delta W$ . Here's how Hu et al. explained this finding in the original LoRA paper:

We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach.

Why does this make sense? During pre-training, we assume that $W$ will be maximally expressive (i.e. a full-rank matrix) to properly capture the complexity of language from a massive dataset. However, when we fine-tune the pre-trained model on a new task, not all of the model's parameters need to be updated — the complexity of a single task pales in comparison to that of learning an entire language via pre-training. In fact, only a fraction of the parameters in $\Delta W$ will include meaningful updates.

Similar to our example at the end of the rank factorization section, we can therefore factorize the contents of the $\Delta W$ matrix into two smaller matrices $M$ and $N$ that capture the same information with fewer parameters:

W' = W + \Delta W = W + MN

If we plug this into our two-layer neural network from earlier, we get an updated version of the layer output equation that uses LoRA:

Y = XW + b

Y = X(W + \Delta W) + b

Y = X(W + MN) + b

Now, instead of fine-tuning all the parameters in $\Delta W$ , we only need to fine-tune those of $M$ and $N$ ! We call this low-rank adaptation (LoRA) since we're performing model "adaptation" (i.e. fine-tuning) using low-rank matrices.

Y = XW + XMN + b

Note also that we can distribute $X$ due to the distributive property of matrix multiplication as seen above. This is an important feature of LoRA since it means that we don't have to modify the pre-trained model (in $W$ ) at all. We can store the LoRA matrices separately from the base model and then simply apply the LoRA matrices during the forward pass. We'll see why this is important later.

Choosing the Dimensions

You might now be asking, "how do we pick the dimensions of $M$ and $N$ ?" The answer is that there's a trade-off. Let's try to understand where this trade-off comes from.

Let $\Delta W_{\text{opt}}$ be the matrix, with rank $r_{\text{opt}}$ , that optimally learns the task under full fine-tuning conditions. Similarly, let $\Delta W_{\text{LoRA}}$ be the matrix that we factorize into $M$ and $N$ under LoRA fine-tuning conditions. We give $M$ and $N$ the dimensions $(m, r_{\text{LoRA}})$ and $(r_{\text{LoRA}}, n)$ respectively, where $r_{\text{LoRA}}$ is the rank that we choose for our $M$ and $N$ matrices.

From our first and second matrix rank properties, we know that the product $\Delta W_{\text{LoRA}} = MN$ will have a rank of at most $r_{\text{LoRA}}$ . In other words. Therefore, if $r_{\text{LoRA}} < r_{\text{opt}}$ , the rank of $\Delta W_{\text{LoRA}}$ will be less than the rank of $\Delta W_{\text{opt}}$ . This means that $M$ and $N$ won't have the size (and therefore parameter count) to sufficiently capture the information that full fine-tuning optimally learns.

Conversely, if $r_{\text{LoRA}} > r_{\text{opt}}$ , we will actually have more parameters than necessary. In other words, the full fine-tuned $\Delta W_{\text{opt}}$ matrix would possess a rank less than $\Delta W_{\text{LoRA}}$ , so we be using too much representation power in $M$ and $N.

We therefore face a trade-off. If you care about significantly reducing the number of trainable parameters (i.e. parameter efficiency), you might choose a low $r_{\text{LoRA}}$ at the expense of compromising the model's ability to learn. If you care about preserving the model's learning capacity, you might choose a high $r_{\text{LoRA}}$ at the expense of the computational burden of training more parameters.

In practice, we want to aim for $r_{\text{LoRA}} \approx r_{\text{opt}}$ . To get there, you will need to pick a point on the trade-off curve that is relevant to your use case and then iteratively adjust $r_{\text{LoRA}}$ based on the fine-tuned model's performance.