Generating Contextual Embeddings with BERT
2

Introduction to BERT

In our introduction to text embeddings, we explored foundational models like word2vec and doc2vec. These models have been pivotal in understanding text-based data, but the evolution of AI demands more nuanced and sophisticated tools. Enter BERT (Bidirectional Encoder Representations from Transformers), a revolutionary step in the world of natural language processing.

BERT, developed by Google AI Language researchers in 2018, emerged as a paradigm shift, leveraging transformer models to understand the context of a word in a sentence much more effectively than ever before. Unlike traditional text embedding methods, BERT captures the essence of bidirectional context, meaning it understands the meaning of a word based on both the words that precede and follow it.

Contextual vs Context-Free Embeddings

To appreciate the leap that BERT represents, let's first remind ourselves of the difference between context-free and contextual embeddings:

  • Context-free embeddings, like those generated by word2vec, represent words in isolation. Each word is assigned a fixed vector, regardless of its contextual usage. For instance, the word "bank" would have the same representation in both "river bank" and "money bank." In the previous lesson, we saw how to generate context-free embeddings using doc2vec.
  • Contextual embeddings, on the other hand, are dynamic. They consider the context in which a word appears, allowing the same word to have different representations based on its surrounding text. This approach is more aligned with how human language operates. doc2vec, which we explored in the previous lesson, extends word2vec to generate contextual embeddings for entire sentences and beyond.

Introduction to Transformers

Recurrent Neural Networks

Before 2017, the landscape of natural language processing was quite different. The field predominantly relied on models of the Recurrent Neural Network (RNN) family, such as Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units (GRUs). These models were designed to process sequences of data, such as text, by feeding one word at a time as input and updating a single hidden state at each step.

RNN

This single hidden state could then be used to encode sequences, make classifications, or even generate new tokens (see seq2seq)! Pretty versatile.

While this approach represented a significant step forward in handling sequential data, it had a fundamental limitation: by encapsulating context in a single hidden state, which was updated sequentially as each word was processed, this design constrained the model's ability to understand and remember long-range contextual dependencies.

In simpler terms, recurrent models struggled with remembering earlier parts of the text as they moved forward sequentially due to the constriction of information flowing through the network in the hidden state. This significantly hindered their effectiveness in common NLP tasks and their usefulness in generating high-quality contextual text embeddings.

The Advent of Transformers

In 2017, everything changed when researchers at Google introduced the transformer architecture in a legendary paper titled "Attention is All You Need". This paper introduced three key innovations that directly addressed the limitations of RNNs:

  1. Parallel Processing: Transformers process all input tokens simultaneously, rather than sequentially. This allows them to maintain and leverage long-range connections within the text, significantly improving their understanding of complex sentence structures.
Parallelism
  1. Encoder-Decoder Architecture: Transformers often use a dual-architecture design, with separate encoder and decoder components. The encoder allows the model to capture the context of the entire input sequence entirely in parallel, while the decoder uses this context to generate the output sequence. This design is particularly effective in tasks like machine translation, where the model must understand an arbitrary-length input sentence before generating the output.
Transformer Architecture
  1. Attention Mechanism: At the heart of the transformer model is the attention mechanism. This feature allows the model to "pay attention" to specific parts of the input sequence while generating each part of the output. Imagine reading a complex sentence; you often focus on keywords or phrases to grasp the overall meaning. The attention mechanism works similarly: it assigns varying levels of importance to different parts of the sentence, enabling the model to keep track of and utilize relationships between distant words or phrases. This ability to maintain and leverage long-range connections within the text is a significant leap forward, allowing transformers to generate more contextually coherent and accurate language outputs than their predecessors.
Self Attention
  1. Pre-training Language Models: Another pivotal aspect of transformers is their capability to be "pre-trained". Pre-training is the initial, compute-intensive training step where a transformer model is exposed to massive amounts of text data before being fine-tuned for specific tasks. This step is done using self-supervised learning tasks, where the model learns to predict missing words in a sentence over and over again. This extensive pre-training over billions of sentences allows the transformer to learn the nuances of a language, hence giving us language models (or large language models (LLMs) due to their scale).
Pre-training

If you'd like to go deeper into the transformer architecture, I highly recommend reading Jay Alammar's The Illustrated Transformer. It is a fantastic guided resource that will walk you through the transformer's inner workings in a clear and visual manner.

BERT's Breakthrough

From 2017 through 2020, in the years after this breakthrough, transformer models evolved rapidly, with each new iteration pushing the boundaries of what was possible in NLP. These included models like XLNet, OpenAI's GPT, and Google's T5, each of which introduced innovations and improvements. However, one of the most influential early developments was called BERT, which was introduced by Google in 2018.

🔎 BERT was such a big breakthrough that it led to the development of a whole family of models, including RoBERTa, ALBERT, and DistilBERT, that are all based on the same architecture.

BERT, which stands for Bidirectional Encoder Representations from Transformers, applies the power of transformers to the problem of generating contextual text embeddings. The model's success can be attributed to three novel design decisions: its bidirectional context, pre-training tasks, and encoder-only architecture.

Bidirectional Context

Traditional transformer models are causal, meaning they are trained by repeatedly learning to predict a word conditioned solely on the previous words in its sentence. BERT, on the other hand, is trained by predicting a word conditioned on both the previous and subsequent words in its sentence. As a result, we say that BERT is deeply bidirectional.

Bidirectional

In the example above, we see the unidirectional training approach of the traditional transformer as it attempts to predict "jumped" from "The quick brown fox". In contrast, BERT's bidirectional training teaches it to predict "jumped" from both the preceding and subsequent words, "The quick brown fox" and "over the lazy dog".

Pre-Training Tasks

BERT uses two novel self-supervised pre-training techniques to enable its bidrectional context: Masked Language Model (MLM) and Next Sentence Prediction (NSP). The MLM technique involves randomly masking words in a sentence, as illustrated in the example above, and learning to predict them based on their context. This forces the model to develop an understanding of the sentence as a whole, as it cannot rely on sequential cues to predict the next word.

The NSP technique, on the other hand, trains the model to predict the relationship between two consecutive sentences, teaching it to understand the broader context beyond individual sentences. Together, these two pre-training tasks allow BERT to develop a deep understanding of the context in which words appear, making it highly effective at generating contextual text embeddings.

Encoder-Only Architecture

Lastly, BERT only uses the encoder half of the transformer architecture. This is a departure from how most transfomers work, which use the encoder to understand the input and the decoder to generate new output tokens. This design decision is what allows BERT to instead output highly contextualized embeddings of its inputs. We'll learn more about this input-output space below.

Encoder

Inputs and Outputs

🔎 In NLP, a token is an atomic unit of text, often a word, but it can also be a character, subword (word fragment), n-gram (nn adjacent words), or a phrase, depending on the level of granularity required. Tokenization is the process of splitting a piece of text into these tokens. This is a critical step in NLP because it transforms unstructured text into a structured form that models can understand and analyze. Also, a model's vocabulary is simply the set of unique tokens that it recognizes.

To understand how BERT generates contextual text embeddings, we first need to understand the model's inputs and outputs. For each input token that BERT receives, it will output a corresponding contextualized embedding. These embeddings are vectors of either 384, 768, or 1024 dimensions, depending on the BERT variant being used. Here's what this looks like:

BERT

You might be wondering what the [CLS] and [SEP] tokens are doing in the diagram. Good question: these are both special tokens that BERT uses to structure its input. The role of the [CLS] token is to represent the meaning of the entire input sequence. It comes at the beginning of the input and is often used for downstream sentence classification tasks. BERT's input also features a [SEP] ("separation") token, which is placed at the end of the input sequence and is used to separate adjacent text segments.

Every embedding generated by BERT, including those for the [CLS] and [SEP] tokens, represents the meaning of the corresponding input token contextualized by the surrounding tokens in the input. For instance, the embedding for "riding" (h_riding in the diagram) would encapsulate the word's meaning as it relates to this specific context (i.e. in describing a man on a horse). The [CLS] token's embedding, in turn, captures the meaning of the entire segment, effectively providing a single embedding for the phrase "man is riding a horse"!

The [CLS] token's embedding is precisely how we're able to generate text embeddings for entire arbitrary-length documents. Paired with the model's breakthrough ability to understand long text sequences, BERT is an ideal tool for the task of semantic search.