LoRA: 0 to 100
Alright, we've ascertained that performing full fine-tuning on an LLM involves updating a massive number of parameters and requires a lot of compute... more compute than most of us have access to or are willing to pay for. So that's a non-starter.
But what if there was a way to reduce the number of parameters that need to be updated at each training step (we call these trainable parameters)? What if we could take a model with 7 billion parameters and change its behavior by fine-tuning only 30 million parameters? These is precisely what LoRA aims to accomplish.
Brief Background
Believe it or not, the idea of reducing the number of trainable parameters in a model is not new. For many years, we've employed techniques like pruning to strip out unused parameters from the model and parameter sharing to group parameters together.
In the wheelhouse of LLM fine-tuning, it is not uncommon to see techniques like layer freezing where we only update parameters in the final few layers of the model. And prefix tuning where we train special "prefix" tokens that are prepended to the actual input tokens and guide the model's predictions.
However, the common denominator between these techniques is that they don't modify the model holistically. This is a significant limitation, as the specific circuitry that is most relevant to our fine-tuning task might be located anywhere within the model and typically involves a complex interplay between many layers.
LoRA offers an alternative — using clever linear algebra tricks, we can actually update an entire model with a fraction of the parameters. Crazy, right? To understand how this is possible, it is necessary to first understand LoRA's mathematical underpinning: low-rank factorization.
Low-Rank Factorization
Matrix Rank
The rank of a matrix, put simply, is a measure of its information density (this is a crude analogy so don't get mad at me). If a matrix has a rank of 1, it means that all of the information in the matrix can be encoded in a single vector. If the matrix has a very high rank, it means that it encodes a lot more information. Take the following matrix for example:
This matrix has a rank of 1 since the second and third columns are scalar multiples of the first column! The second column is just the first multiplied by 2, and the third column is just the first multiplied by 3. This effectively means that all the information in the matrix can be encoded in a single vector equal to the first column: . Now consider a second matrix:
What do you think the rank of this matrix is? It's 2! Notice that it is impossible to create the first row using a combination of the second and third rows. Put formally, there is no linear combination of the rows that can form the first row, meaning the first row is linearly independent of the other two. On the other hand, the second and third rows are linearly dependent since the third row is just the second row multiplied by 2. We can therefore encode the matrix's information using two vectors, and , giving us a rank of 2.
This brings us to an actual definition: the rank of a matrix is the maximum number of linearly independent rows or columns in the matrix. If the matrix has a rank of , it means that its contents can be expressed using only linearly independent vectors, as we did above.
We say that a matrix is full-rank if it has the maximum possible rank. In other words, the rank satisfies , where is the number of rows and is the number of columns. In this case, every row or column is linearly independent of the others, and the matrix encodes the maximum amount of information.
Conversely, we call a matrix low-rank (or rank-deficient) if it has a rank significantly less than the maximum possible rank: . For example, the first matrix we looked at would be considered low-rank since its rank of 1 is less than its maximum possible rank of 3!
Properties of Matrix Rank
Very quickly before moving on, there are two key properties of matrix rank that we need to know. We will lean on these properties later to understand exactly how LoRA works:
- The rank of a matrix is constrained by the minimum of its number of rows and columns . So if you have a matrix with dimensions , its maximum possible rank would be 2. This is fairly self-evident from the definition of a full-rank matrix, but it's worth repeating:
- For two matrices and , the rank of their product is constrained by their individual ranks. Intuitively, when we combine two matrices, the resulting matrix will encode only as much information as the least informative of the two matrices:
Rank Factorization
Now for the fun part. The rank factorization of a matrix is a way to factor it into two smaller matrices, and . Specifically, if has rank and dimensions , then there always exists a matrix with dimensions and a matrix with dimensions such that the following matrix product holds:
Using this rule, the rank factorization of our first matrix with dimensions and rank 1 would result in the a matrix and a matrix :
That was a lot at once, so let's break it down step by step using a concrete example. Imagine we have a matrix with dimensions that happens to have a matrix rank of 10. From the first property that we learned above, we know that the maximum possible rank of is . In this case, this corresponds to the number of rows, or 100. This means that is very low-rank (), encoding only a fraction of the information that it theoretically could.
So let's factorize into two smaller matrices and such that . This is typically done using an algorithm called singular value decomposition (SVD), but that's a topic for another day. The key point is that will have dimensions and will have dimensions , meaning that the number of elements has decreased from 100,000 in to 11,000 in . Therefore, we have successfully represented the same amount of information as in but with only 11% of the original data volume (number of matrix elements)!
Note that and are always full-rank matrices. From our first property, we know that their maximum possible rank is due to their dimensions. But from the second property, we know that the rank of and set a ceiling on the rank of their product . Since we know 's rank is (because it is equal to ), this means that both and must have a minimum rank of . Putting both properties together, and assume a rank of exactly and, since this equals their smaller dimensions, they are full-rank.
Let's quickly intuit why all this is even possible. The matrix is very low-rank — it encodes far less information than it theoretically could at its size. This suggests that there are many repeated patterns in the data (like our first low-rank matrix from the previous section) that we can factor out using clever algorithms like SVD. In other words, we replace the noisy matrix with two smaller, high-signal matrices. We call this technique of rank factorization on low-rank matrices... low-rank factorization. We'll return to this momentarily.
Parameter Matrices
Now back to the world of deep learning. Before we can tie this all together, we need to understand how the parameters of a neural network (and therefore LLMs) are typically represented. If you guessed "as matrices", you're exactly right.
The image below depicts a traditional 3-layer neural network. The input layer has 3 neurons, the hidden layer has 6 neurons, and the output layer has 1 neuron. The input layer doesn't actually perform any computation, it just describes the dimensionality (or number of features) of the input data. The second and third layers are where the magic happens: we call these linear (or dense or fully connected) layers, and they're responsible for performing the actual computations.

In a linear layer, each neuron is connected to every neuron in the preceding layer, using a simple linear equation. Consider the first neuron in the second layer (the first green circle). It takes the three values from the input layer, , , and , and multiplies each by its corresponding weights, , , and . Here's what this looks like, where is an additional trainable parameter called the bias and is the output of that first neuron:
Guess what! We can represent this equation using a vector dot product, where is a vector of the weights and is a vector of the inputs:
So far, we've only modeled the output of a single neuron in the second layer. Can we instead model the outputs of all 6 neurons in the second layer? Yes, using matrix multiplications:
- Our input vector becomes a input matrix . If we wanted to pass multiple inputs to the layer at once, would become a input matrix, where is the batch size.
- Our weight vector becomes the first column of our weight matrix , which contains the weights of all 6 neurons in the second layer. Each neuron has its own column. The dots in the matrix represent the weights of the other neurons.
- Our bias becomes the first column of a bias matrix . Each neuron has a single bias. Similarly to the weights matrix, the dots in the matrix represent the biases of the other neurons.
- The output of the second layer is then a output matrix : each column contains the output of a single neuron. Notice that the first column of contains the same output that we derived above for the first neuron. If we passed multiple inputs to the layer at once, would become a output matrix.
Congratulations — if you've read this far, we've just devised a clean formula to represent the outputs of an entire linear layer! Following this approach, the third layer would be represented as a weight matrix and a bias matrix . In practice, we would then wrap this output in a non-linear activation function like ReLU or tanh before passing it to the next layer:
The primary takeaway here is that we can represent each layer as a set of parameter matrices. In our simple 3-layer network, the parameters are primarily contained in the weight matrices and . And this is also true for LLMs! All the parameters of LLMs, for both attention and feedforwards layers, are represented as parameter matrices.
Putting LoRA Together
Key Insight
Consider an arbitrary parameter matrix of an LLM that we're fine-tuning. Since full fine-tuning will update every parameter in the model, we can represent the final fine-tuned matrix as the element-wise sum of our original matrix and a new matrix that represents the updates for each parameter. In this reframing, is frozen and contains the deltas from all the newly fine-tuned parameters:
The key insight behind LoRA is that the matrix is low-rank. In other words, has a low information density. The fine-tuning process isn't actually leveraging the flexibility of all the available trainable parameters and is instead re-learning the same patterns over and over again in . Here's how Hu et al. explained this finding in the original LoRA paper:
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach.
Why does this make sense? During pre-training, we assume that will be maximally expressive (i.e. a full-rank matrix) to properly capture the complexity of language from a massive dataset. However, when we fine-tune the pre-trained model on a new task, not all of the model's parameters need to be updated — the complexity of a single task pales in comparison to that of learning an entire language via pre-training. In fact, only a fraction of the parameters in will include meaningful updates.
Similar to our example at the end of the rank factorization section, we can therefore factorize the contents of the matrix into two smaller matrices and that capture the same information with fewer parameters:
If we plug this into our two-layer neural network from earlier, we get an updated version of the layer output equation that uses LoRA:
Now, instead of fine-tuning all the parameters in , we only need to fine-tune those of and ! We call this low-rank adaptation (LoRA) since we're performing model "adaptation" (i.e. fine-tuning) using low-rank matrices.
Note also that we can distribute due to the distributive property of matrix multiplication as seen above. This is an important feature of LoRA since it means that we don't have to modify the pre-trained model (in ) at all. We can store the LoRA matrices separately from the base model and then simply apply the LoRA matrices during the forward pass. We'll see why this is important later.
Choosing the Dimensions
You might now be asking, "how do we pick the dimensions of and ?" The answer is that there's a trade-off. Let's try to understand where this trade-off comes from.
Let be the matrix, with rank , that optimally learns the task under full fine-tuning conditions. Similarly, let be the matrix that we factorize into and under LoRA fine-tuning conditions. We give and the dimensions and respectively, where is the rank that we choose for our and matrices.
From our first and second matrix rank properties, we know that the product will have a rank of at most . In other words. Therefore, if , the rank of will be less than the rank of . This means that and won't have the size (and therefore parameter count) to sufficiently capture the information that full fine-tuning optimally learns.
Conversely, if , we will actually have more parameters than necessary. In other words, the full fine-tuned matrix would possess a rank less than , so we be using too much representation power in and $N.

We therefore face a trade-off. If you care about significantly reducing the number of trainable parameters (i.e. parameter efficiency), you might choose a low at the expense of compromising the model's ability to learn. If you care about preserving the model's learning capacity, you might choose a high at the expense of the computational burden of training more parameters.
In practice, we want to aim for . To get there, you will need to pick a point on the trade-off curve that is relevant to your use case and then iteratively adjust based on the fine-tuned model's performance.