Quantization: Shrinking the Size of LLMs
3

Your First Quantization

Enough theory — let's learn how quantization is actually performed! In the previous lesson, we learned about using the Hugging Face transformers library to load PyTorch implementations and weights for virtually any open-source LLM. We also learned about Hugging Face's peft library to attach LoRA to the PyTorch model.

In this section, we're going to use a similar approach with the bitsandbytes library, which provides a simple API to quantize PyTorch models. As we'll see, bitsandbytes is simply one specific implementation of the quantization strategies that we learned about earlier, and there exist many other implementations and formats that we'll take a look at.

Intro to bitsandbytes

bitsandbytes is a popular quantization library that is designed to generate quantized versions of existing PyTorch models. It implements an algorithm for dynamic quantization based on the LLM.int8() paper, in which most of the model is quantized to int8 but critical outlier values remain in the higher-precision float16. The Hugging Face blog has a great post on how this works under the hood.

Another key feature of bitsandbytes is that it supports k-bit quantization, meaning it can perform dynamic quantization for several different bit widths (although 8-bit and 4-bit are the most common). On top of this, it uses block-wise quantization as we learned about earlier, which means that the quantized models are highly accurate but will use a bit of additional memory due to the extra quantization parameters.

Before moving on, let's make sure that all the necessary Python packages we'll be using are correctly installed. Setup a virtual environment and install the following packages:

Bash
pip transformers torch bitsandbytes accelerate optimum

Loading the Model

Let's start by loading the model and tokenizer for Meta AI's Open Pre-Trained Transformer (OPT) using the transformers library. Specifically, we will use the version with 350M parameters called OPT-350M. This is a fairly small LLM, especially when compared to other "small" models like Llama 2 7B.

🔎 These examples are meant to be run on GPUs. If you don't have access to your own GPU (which is very common so don't worry), there are free services like Google Colab and Kaggle Notebooks that offer notebook environments with free GPUs. I recommend you head over to Colab and spin up a free notebook with a T4 GPU. You can also pay a small amount for services like Paperspace to access more powerful GPUs.

As we saw in the last lesson, the AutoModelForCausalLM and AutoTokenizer classes will automatically choose the correct PyTorch implementation for our model choice and download the weights from the Hugging Face model hub. In this case, AutoModelForCausalLM will resolve to an instance of OPTForCausalLM, which is a task-specific subclass of OPTModel for causal language modeling.

Python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

Let's then define a utility function that will calculate a model's memory footprint and print it out in GB. This will be useful for comparing the memory usage of the float32 and various different quantized models.

To do this, we will iterate over every single parameter matrix in the model using the built-in model.parameters() and calculate its size by multiplying the number of elements in each matrix by the size of each element in bytes (4 bytes for float32). This gives us a figure total_bytes which represents the total memory consumption of the model in bytes. We then divide this by 1024^3 to convert it to GB and round to three decimal places.

Python
def calculate_model_size(model) -> float:
    total_bytes = sum(
        p.numel() * p.element_size()
        for p in model.parameters()
    )

	return round(total_bytes / 1024 ** 3, 3)

print(f'float32 model: {calculate_model_size(model)} GB')
Output
float32 model: 1.234 GB

When we call calculate_model_size on the original float32 version of OPT-350M, we find that the model consumes around 1.234GB of memory. In practice, this is more than small enough to fit within the VRAM of most GPUs (even commodity GPUs have between 8 and 24GB of VRAM), but let's see how much we can further reduce this by quantizing the model.

Quantizing the Model

Now, let's quantize the model to float16 using the built-in half() function in PyTorch (recall that this isn't really quantization but it's a good starting point). Since we're using half-precision floating points, we'd expect the memory consumption to be halved as well. Let's run the calculate_model_size function on the float16 model:

Python
model = model.half()
print('float16 model:', calculate_model_size(model))
Output
float16 model: 0.617 GB

Sure enough, we get exactly half the memory consumption at 0.617GB. This is a great result, but we can go even further by quantizing the model to int8.

One of the big benefits of bitsandbytes as an option for quantizing LLMs is that it's tightly integrated with Hugging Face transformers. If you have the bitsandbytes package installed, transformers will automatically expose a convenient config class called BitsAndBytesConfig. When you pass this config to the same from_pretrained method that we used earlier, Hugging Face will simultaneously load and quantize the model using the bitsandbytes dynamic quantization algorithm.

The absolute simplest way to use this class is to set the argument load_in_8bit to True, but there are many other arguments that you can play around with. When we pass our quantization config into from_pretrained, the model weights will be loaded (they should be cached locally from last time) and immediately quantized to int8. Note that this process occurs entirely on the fly without any calibration necessary since bitsandbytes uses a dynamic quantization method. Now, if you print out the model with print(model), you'll notice that the Linear layers have all been replaced with Linear8bitLt layers from the bitsandbytes library.

Python
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
	"facebook/opt-350m",
	quantization_config=quantization_config
)

print('int8 model:', calculate_model_size(model))
Output
int8 model: 0.335 GB

As expected, we've halved the memory footprint again. We're now down to 335MB from the original 1.234GB! Let's do one more halving by quantizing the model to int4 using the load_in_4bit argument in our quantization config:

Python
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
	"facebook/opt-350m",
	quantization_config=quantization_config
)

print('int4 model:', calculate_model_size(model))
Output
int4 model: 0.194 GB

There you have it — just under 200MB for our int4 quantized model. This represents a massive 85% reduction in memory consumption from the original float32 model. The best part is that the steps to run inference on these quantized models are identical to float32 models.

Python
"""
Here, we take a prompt and tokenize it using the model's tokenizer. This will translate the prompt into discrete tokens from the model's vocabulary. The `to("cuda")` call moves the PyTorch tensors with the tokenized inputs to GPU memory where the model is stored.
"""
text = "The meaning of life is"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

"""
We then run `model.generate` with our quantized model and the tokenized inputs. Since the model will produce a sequence of tokens up to a maximum length of 256, we can decode the output tokens back into human-readable text using the tokenizer's `decode` method.
"""
outputs = model.generate(**inputs, max_new_tokens=32)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output
The meaning of life is to be alive.

Note that models quantized with bitsandbytes are only meant to run on the GPU. As a result, when from_pretrained detects a bitsandbytes quantization config on the model, it will immediately load it into GPU memory instead of the default CPU memory. There's no need to call to("cuda") or cuda() on the model to manually move it to the GPU. Also, if you try to move the model to CPU memory, you'll get an error. Later, we'll take a look at quantization formats that can run on both CPUs and GPUs.

Saving Quantized Models

The final step is to save our quantized model. This is a hugely important step as it allows us (and others) to load a pre-quantized version of the model in the future, saving us the time and resources needed to download the unquantized weights and quantize them on the fly. We'll touch heavily on this general idea in the next section.

Saving to Disk

As we've seen, Hugging Face's AutoModelForCausalLM.from_pretrained() method returns a PyTorch implementation of the requested model. To make our lives easier, Hugging Face includes several convenience methods on these PyTorch models. One such function that we've used extensively is model.generate(), which in our case performs the causal language modeling task of iteratively generating the next token until max_new_tokens or the EOS (end of sequence) token is reached.

Another convenient function is model.save_pretrained(), which allows us to save the quantized model to disk. Using it is as simple as passing in a target directory path:

Python
model.save_pretrained("opt-350m-int8-bitsandbytes")

This will create a directory called opt-350m-int8-bitsandbytes with three files: a model.bin file containing the pickled model parameters, a config.json with the initialization config (including the bitsandbytes config in our case), and a generation_config.json with information on the model's generation settings (e.g. special tokens).

One issue with this approach is that the serialization library that it uses, called pickle, is fundamentally unsafe as it lets you run arbitrary code when deserializing. There are many other formats like h5 and SavedModel that have their own downsides as well. To answer the call for a safer serialization format, Hugging Face has recently developed a new library called safetensors that is designed to be safe, efficient, and flexible. This new library is quickly becoming the default across the Hugging Face ecosystem.

To use safetensors instead of pickle, simply call the save_pretrained method like we did above but with the safe_serialization argument set to True:

Python
model.save_pretrained("opt-350m-int8-bitsandbytes", safe_serialization=True)

This will generate a model.safetensors file instead of the .bin file. Then, loading the local model files is as simple as loading models from the Hugging Face hub! The load_pretrained method will automatically detect the serialization format and that the model lives on disk rather than in the Hugging Face hub:

Python
model = AutoModelForCausalLM.load_pretrained("opt-350m-int8-bitsandbytes")

Saving to Hugging Face

There's one more cool thing you can do here and that is to upload your quantized model to the Hugging Face model hub! First, sign up for a Hugging Face account and then authenticate your account with the CLI if developing locally or with notebook_login if using a notebook environment. Then, you can use the push_to_hub method to upload your model to the hub:

Python
quantized_model.push_to_hub(
	repo_id='alexlangshur/opt-350m-int8-bitsandbytes',
	revision='main',
	private=True,
	safe_serialization=True
)

This will create a new private model repository in your Hugging Face account and upload the quantized model files to it under the main revision. Revisions in Hugging Face are analogous to branches in Git repositories and will default to main if not specified. The safe_serialization argument will ensure that the model is serialized using safetensors rather than pickle.

When the upload is complete, the repo will include a model.safetensors file with the quantized weights, a config.json file with the model's configuration (and the quantization config!), and a generation_config.json file with the model's generation settings (for running model.generate). If other users were to download and use this model, transformers would use the config.json to detect that the model is quantized and automatically load it as such with the bitsandbytes backend, avoiding the need to quantize the model on the fly as we did earlier.

🔎 When naming quantized models on the hub, it's convention to append the model format and often the bit width to the end of the model name, as we've done here with "opt-350m-int8-bitsandbytes".

To make this model repo even easier to use, we can also upload the tokenizer! This will allow others to effortlessly download the model and tokenizer from the same location, which is a big win for usability. To do this, simply call the push_to_hub method on the tokenizer and pass in the same repo ID and revision:

Python
tokenizer.push_to_hub(
	repo_id='alexlangshur/opt-350m-int8-bitsandbytes',
	revision='main',
)

The repo will now receive a couple additional files, including a tokenizer.model file with the tokenizer's vocabulary and several configuration files. You can now load both the pre-quantized model and its tokenizer from the hub with the from_pretrained method:

Python
model = AutoModelForCausalLM.from_pretrained("alexlangshur/opt-350m-int8-bitsandbytes")
tokenizer = AutoTokenizer.from_pretrained("alexlangshur/opt-350m-int8-bitsandbytes")

The finishing touch is to add a model card to the repo by uploading a README.md file. The model card will be the first thing that users see when they visit the repo and typically contain a description of the model, instructions for how to use it, and any other relevant information. Then, head to the repository to check out the final product!

Hugging Face