LoRA: Reducing Trainable Parameters
2

Making LLMs More Accessible

Years ago, when LLMs were first receiving mainstream attention, it was tacitly assumed that only the largest companies would ever afford to train and run them. But, as we'll see, with clever techniques like quantization and low-rank adaptation, it is now possible to spin up a $1/hour (or even free) GPU-backed Jupyter notebook and fine-tune massive language models in a matter of hours. Before exploring the mechanics of these two techniques, we must first learn about the two main burdens of LLM fine-tuning: compute and memory.

Compute Requirements

The first big burden when it comes to LLM training is the required compute, meaning the computational resources needed to train and run the model. Since LLM forward passes (i.e. running the model) are effectively just millions of operations over the model's floating point parameters, we can quantify compute capabilities in terms of the number of floating point operations (e.g. additions, multiplications) that we are able to perform in a second (abbreviated as FLOPs). More GPUs and more powerful GPUs mean more FLOPs, which means faster model training (and inference).

A corollary to this, however, is that larger models have more floating point parameters and therefore require far more FLOPs to train in the same amount of time. To understand why, let's take a quick look at how LLMs are fine-tuned under the hood.

Fine-Tuning Under the Hood

As we saw in our deep-dive into fine-tuning, we start by taking a single training example, consisting of an input and output, and using the next-token prediction task to autoregressively (i.e. one token at a time) predict each output token given all the preceding tokens:

Next-Token Prediction

We then calculate a loss using our loss (or objective) function that reflects how close the model's predictions are to each actual output token (lower is better) and then use an algorithm called backpropagation to gently nudge the model's parameters in the direction that minimizes this loss. Here's what this parameter update rule looks like mathematically, where pp is a model parameter, LL is the loss, and α\alpha is the learning rate:

pnew=poldαLpp_{\text{new}} = p_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial p}

Backpropagation provides a closed-form algorithm for calculating the partial derivative of the loss function with respect to each parameter, reflected by the term Lp\frac{\partial L}{\partial p}. But that was a mouthful, so let's put it in simpler terms: backpropagation tells us how much each parameter should be updated (and in what direction) to minimize the loss. So, if Lp\frac{\partial L}{\partial p} is a large positive number, this means that we should significantly decrease pp to minimize the loss.

In the rest of the equation, we multiply this partial derivative by the learning rate α\alpha, a small positive term that regulates the size of the updates, and subtract the result from the old parameter value. With our big positive partial derivative, this will result in a large negative update to the parameter, leading to a decrease in the loss on the next training step. The repetition of this process over many training steps is called gradient descent, since we're descending the loss landscape along the gradient of the loss function.

Gradient Descent

In practice, we typically combine multiple training examples into a single parameter update step (a "batch") to smooth out the noise in the updates. In other words, we combine the loss from BB training examples into a single loss, where BB is called the batch size, and then calculate the partial derivative of this combined loss with respect to each parameter. This approach is called batch gradient descent, as opposed to stochastic gradient descent (what we saw above) which uses a batch size of 1.

The Catch

What we just described is called full fine-tuning, since every single parameter in the model gets updated at each training step, like pre-training. Doing so is prohibitively expensive for even the smallest LLMs (e.g. Llama 7B), let alone the largest ones with 100B or more parameters.

Say for example that we'd like to perform full fine-tuning on a 7B-parameter model, on a fine-tuning dataset with 10k examples, a batch size of 16, and 3 epochs (i.e. 3 training passes through the dataset). Each training step would use 16 examples, so there would be 10,000 / 16 = 625 steps per epoch, and 625 * 3 = 1,875 steps in total. This means that each of the 7B parameters would be updated 1,875 times, requiring a total of 13.125T parameter updates. This is a lot of compute! Fine-tuning the entire model would take ages on consumer hardware and cost a fortune on the cloud.

Memory Requirements

Arguably an even bigger burden when it comes to LLM training is the required memory. To understand why this is the case, let's take a look at how hardware for a training setup is typically structured.

Memory Bandwidths

GPUs are the workhorses of deep learning since they're capable of performing millions of float point operations in parallel, unlike CPUs. Beyond their many ALUs (arithmetic logic units), which are responsible for parallelizing these operations, GPUs also have a large amount of dedicated memory (called VRAM or video RAM).

GPU Memory

This illustration demonstrates the relative data bandwidths between CPU memory, the CPU, and both components of the GPU. The key takeaway is that data bandwidth between VRAM and the GPU chip (where the computation happens) is significantly greater than the data bandwidth between the CPU and the GPU.

The Catch

What does this mean for fine-tuning? Well, since training involves constantly moving the model's parameters back and forth between a memory location and the GPU cores, the memory bandwidth is a critical bottleneck. As a result, we must store all of the model's parameters (and intermediate computations like activations and gradients) in VRAM! Putting any of these in CPU memory would be far too slow.

The first issue with this is that GPU memory is expensive, far more expensive than CPU memory. Most consumer-grade GPUs (such as the RTX 4090) ship with up to 24GB of VRAM at the top end, while professional GPUs used in data centers (such as the A100) can have up to 80GB of VRAM (and cost you $20k per GPU). Compare this to CPU memory where it is common to see consumer builds with 64-128GB and cloud instances with 1TB+.

GPU VRAM

The second issue is that LLMs are massive — let's take Llama 7B for example. With 7B parameters, each of which is a 32-bit floating point number, the model's parameters alone require 7B * 32 / 8 = 28GB of VRAM. This is already more than the VRAM of most consumer-grade GPUs, and we haven't even considered the memory required for the model's activations and gradients, which are also stored in VRAM during training.

When you move up to even larger models like Llama 70B, it becomes impossible to fit the entire model into VRAM, requiring techniques like model parallelism to distribute the model across multiple GPUs. Suddenly, the only way to train these models is to enlist 4x A100s from a cloud provider, which won't be cheap.