Quantization: Shrinking the Size of LLMs
theory
quantization
3
Introduction
In the previous lesson, we covered the key limitations of fine-tuning LLMs; namely, the required compute and GPU memory. We then dove (very) deep into a technique called low-rank adaptation (LoRA) that significantly reduces the number of trainable parameters in LLMs for fine-tuning.
Before we can put everything we've learned together to assemble our own fine-tuning pipeline, we need to cover one more technique: quantization. Let's get started!