Open Engineer

Assembling BERT

🔎 The rest of this lesson implements and runs BERT on your local machine! If you don't have a GPU available and would like to follow along, I'd recommend using Google Colab, which offers a hosted Jupyter Notebook environment with free GPU resources.

Now that we understand how BERT works and why it's such a big deal, let's learn how to implement it! To do so, we'll use the Hugging Face transformers library, which provides a convenient Python interface for downloading and running transformer models that are publicly hosted on Hugging Face.

The library provides a diverse selection of ready-to-use models such as BERT that you can download locally and use for a wide range of NLP tasks. It is compatible with PyTorch and TensorFlow, the two leading deep learning frameworks, enabling easy integration of models. For our purposes, we'll use BERT's implementation in PyTorch.

Getting Started

To get started, in your Python virtual environment, you'll need to install the transformers library, along with PyTorch. We'll also use NumPy, a popular linear algebra library, for storing and interacting with our embedding vectors.

Bash

pip install transformers torch numpy

The first and simplest step is to import and build our BERT tokenizer and model. The transformers library provides us with PyTorch implementations of the two main components that we'll use for generating BERT text embeddings: BertTokenizer and BertModel.

A transformer model's tokenizer is responsible for converting text into numerical tokens that the model can understand. To access BERT's tokenizer, we'll use the BertTokenizer class. The BertModel class is then the actual BERT model implementation that we'll use to generate embeddings. Both these classes inherit the from_pretrained method, which allows us to initialize them with a specific pre-trained version of BERT hosted on Hugging Face.

Python

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = BertModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

In our case, we'll leverage a specialized version of BERT known as Sentence-BERT (SBERT), specifically the all-MiniLM-L6-v2 version. This variant is an enhancement of the original BERT, tailored for more efficient sentence-level embedding generation. It is based on the MiniLM architecture, which is a compact yet powerful model that generates 384-dimensional dense vectors. Due to its reduced size, with only 6 layers compared to standard BERT's 12 layers, it is both faster and more resource-efficient for embedding tasks.

This step may take a few minutes to complete since you will be downloading the model's entire vocabulary and all 22M parameters to your local machine!

BERT Tokenizer

Tokenization is the process of breaking down a text sequence into individual tokens, which allows us to standardize text for downstream processing. A sentence tokenizer (such as NLTK's sent_tokenize) breaks a text sequence into sentences, while a word tokenizer (such as NLTK's word_tokenize) breaks a sentence into words. BERT's tokenizer, on the other hand, is a subword tokenizer, which breaks down words into smaller subword units called wordpieces. This is a crucial distinction, as it allows BERT to handle out-of-vocabulary (OOV) words, which are words that are not present in the model's vocabulary. BERT's tokenizer is also case-sensitive, meaning that it distinguishes between uppercase and lowercase letters.

BERT's tokenizer works in two steps. First, it converts unstructured text to a list of wordpieces. For example, the word "university" might be broken down into un, ##iver, and ##sity. Next, it maps each wordpiece to its corresponding ID in the BERT vocabulary so that it may be understood by the model. For instance, un might be mapped to 123, ##iver to 456, and ##sity to 789.

The BertTokenizer class provides two convenient methods for performing this tokenization process: tokenize and convert_tokens_to_ids. We'll start by wrapping our text document with the special [CLS] and [SEP] tokens, which we learned about earlier, and then tokenize it using the tokenize method. This returns a list of wordpieces, which we can then convert to their corresponding IDs using the convert_tokens_to_ids method.

Python

document = "Hello, world!"
marked_document = "[CLS] " + document + " [SEP]"
tokens = tokenizer.tokenize(marked_document)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f'tokens: {tokens}')
print(f'token_ids: {token_ids}')

Output

tokens: ['[CLS]', 'Hello', ',', 'world', '!', '[SEP]']
token_ids: [101, 8667, 117, 1362, 106, 102]

Tokenizing Multiple Documents

We'd like to be able to use the BERT model to generate embeddings for multiple documents at once. The reason for this is that deep learning models are typically optimized for batch processing. It will be critically important to reap these performance benefits so that we can efficiently generate BERT embeddings for a large corpus of documents.

We call the number of documents in a batch the batch size. The model expects a single 2-dimensional tensor (i.e. a matrix) as input with shape $(B, L)$ where $B$ is the batch size and $L$ is the number of tokens in each document in the batch. This raises an important issue: not every document in the batch will have the same number of tokens (i.e. the same value of $L$ ).

To solve this problem, we'll need to pad the shorter documents in the batch and set $L$ to the length of the longest document. Thankfully, the BertTokenizer class has a shortcut that performs the entire tokenization pipeline we wrote out above while abstracting away the complexity of padding:

Python

documents = [
	"Hello, how are you?",
	"I am fine.",
	"Thanks for asking!"
]

model_inputs = tokenizer(
	documents,
	padding=True,
	truncation=True,
	return_tensors="pt"
)

By calling the tokenizer instance directly, we can provide a list of documents and it will generate all the necessary tokenized inputs for the model. The padding and truncation arguments tell the tokenizer to pad the shorter documents and truncate the longer ones, respectively (BERT has a maximum input size of 512 tokens). The return_tensors argument tells the tokenizer to return the inputs as PyTorch tensors.

The tokenizer outputs a dictionary containing three tensors: input_ids, token_type_ids, and attention_mask. The input_ids key contains the token IDs for each document in the batch, while the token_type_ids and attention_mask keys contain additional information that tells the model to ignore the padding tokens for documents that are shorter than $L$ . Each of these tensors has the shape $(B, L)$ and, in this case, $B$ is 3 (the number of documents) and $L$ is 8 (the length of the longest document).

BERT Model

Now that we have the inputs for the model, we can generate embeddings! The BertModel class expects as input the 2-dimensional tensor of token IDs with shape $(B, L)$ . We'll then extract from its output a 2-dimensional tensor of document embeddings with shape $(B, E)$ , where $E$ is the embedding dimension (384 for the BERT model we're using).

Python

with torch.no_grad():
	model_outputs = model(**model_inputs)

embeddings = model_outputs[0] # shape is (B, L, E)
document_embeddings = embeddings[:, 0, :] # shape is (B, E)

The torch.no_grad context manager tells PyTorch that we want to ignore gradient calculations in our forward pass of the model. If you're familiar with the basics of machine learning, you'll know that the training step involves a process called backpropagation that uses the gradients of each parameter to iteratively update the model. However, since we're not training the model, we can safely skip this step for better performance. We're effectively putting the model in "inference mode".

Then we can run the model! We pass in and unpack the model_inputs dictionary that we generated earlier, which contains the token IDs for each document in the batch (and the other tensors used for padding and attention). The model outputs a tuple whose first element contains the embeddings for each token in the input. The shape of this tensor is $(B, L, E)$ , where $B$ is the batch size, $L$ is the number of tokens in each document, and $E$ is the embedding dimension.

Since we're interested in document embeddings, we need to extract the first token embedding corresponding to the special [CLS] token for each document in the batch. This is the first element of the second dimension of the tensor, which we can extract using the [:, 0, :] indexing syntax. The shape of the resulting 2-dimensional tensor is $(B, E)$ , where $B$ is the batch size and $E$ is the embedding dimension.

End-to-End BERT

That's it — we are now able to run our BERT model on multiple documents and extract the outputs. Let's put it all together in a single end-to-end function and test it out on some documents. This function will take a list of documents as input and return the 2-dimensional tensor of document embeddings with shape $(B, E)$ .

Python

def generate_bert_embeddings(documents: list[str], tokenizer: BertTokenizer, model: BertModel) -> torch.Tensor:

    # tokenize documents
    model_inputs = tokenizer(
        documents,
        padding=True,
        truncation=True,
        return_tensors="pt"
    )

    # run model
    with torch.no_grad():
        model_outputs = model(**model_inputs)

    # extract document embeddings
    embeddings = model_outputs[0]
    document_embeddings = embeddings[:, 0, :]

	return document_embeddings

documents = ["Hello, how are you?", "I am fine.", "Thanks for asking!"]
embeddings = generate_bert_embeddings(documents, tokenizer, model)
print(embeddings)

Output

tensor([[-0.0406, -0.0248,  0.1411,  ..., -0.1443, -0.2807, -0.1596],
        [ 0.0288, -0.0046,  0.0673,  ...,  0.2228, -0.2635, -0.2127],
        [-0.0253,  0.1019, -0.1553,  ..., -0.2180, -0.1549, -0.2683]])