Open Engineer

Wikipedia Semantic Search

As we saw at the beginning of the lesson, our goal is to build out a BERT-based semantic search pipeline that is capable of querying thousands of sentences from English Wikipedia!

Our first step is to create a corpus, for which we will extract the first 10,000 sentences from English Wikipedia, taking each sentence as an individual document. Following this, we'll use BERT to embed each document so that we can then perform semantic search on the corpus.

Building a Wikipedia Corpus

First, let's design our custom Wikipedia corpus. To do so, we'll define a routine called load_wikipedia_corpus that uses the Hugging Face datasets library to load Hugging Face's English Wikipedia dataset in streaming mode and iterate through each article. For each Wikipedia article, we tokenize the text into individual sentences using NLTK's sent_tokenize function. We then iterate through each sentence and add it to our corpus if it meets our minimum and maximum sentence length requirements.

Python

from datasets import load_dataset
from nltk.tokenize import sent_tokenize, word_tokenize

def load_wikipedia_corpus(corpus_size: int,
                          min_sentence_length: int=10,
                          max_sentence_length: int=100) -> list[str]:

	# load the Wikipedia dataset hosted on Hugging Face
    wikipedia_articles = load_dataset(
        path='wikipedia',
        name='20220301.en',
        split='train',
        streaming=True
    )

    corpus = []
    for article in wikipedia_articles:

		# tokenize the article into sentences
        sentences = sent_tokenize(article['text'])
        for sentence in sentences:

            # add sentence if it is between min and max length
            word_count = len(sentence.split())
            if word_count >= min_sentence_length and word_count <= max_sentence_length:
                corpus.append(sentence)

		# stop after corpus_size sentences
		if len(corpus) >= corpus_size:
			return corpus[:corpus_size]

Semantic Search Pipeline

With the routine prepared to load our Wikipedia corpus, we can now build out our semantic search pipeline, following the steps we learned about in the previous lesson:

Python

from torch.nn.functional import cosine_similarity

"""
(1) Let's start by loading our corpus with 10,000 sentences from English Wikipedia.
"""
corpus = load_wikipedia_corpus(10_000)

"""
(2) The first step in our semantic search pipeline is to embed the corpus. We'll use our `generate_bert_embeddings` function to generate BERT embeddings in chunks of 100 documents at a time. Transformer models are memory-intensive due to their large number of parameters and have memory footprints that increase proportional to the input size. Processing in smaller batches will prevent out-of-memory errors.

Each batch of embeddings we generate will be a 2-dimensional tensor with shape (100, 384). The `torch.vstack` function allows us to then stack these tensors vertically to create a single 2-dimensional tensor with shape (10000, 384).
"""
batch_size = 100
corpus_embeddings = torch.vstack([
    generate_bert_embeddings(corpus[i:i+batch_size], tokenizer, model)
    for i in range(0, len(corpus), batch_size)
])

"""
(3) The next step in our pipeline is to define and embed our query. Let's write out a query and then embed it with the same routine that we used to embed the corpus. Since the `generate_bert_embeddings` function normally expects a list of documents and we only have a single document here, we'll wrap our query in a list.
"""
query = 'Ayn Rand is a Russian-American writer and phisopher'
query_embedding = generate_bert_embeddings([query], tokenizer, model)

"""
(4) Now we can compute the cosine similarity between the query embedding and each document embedding in the corpus. To do so, we'll use the handy `cosine_similarity` function from PyTorch that computes the cosine similarity between two tensors. The result is a 1-dimensional tensor with shape (10000,) containing the cosine similarity between the query and each of the 10,000 documents in the corpus.
"""
similarities = cosine_similarity(corpus_embeddings, query_embedding)

"""
(5) Finally, since greater similarity scores indicate a better match, we can sort the similarities in descending order and print the top 5 results. To do so, we use the PyTorch tensor's `argsort` method, which returns the indices of the sorted similarities.
"""
top_k = 5
top_k_indices = similarities.argsort(descending=True)[:top_k]
for index in top_k_indices:
    print(f'Index: {index}')
    print(f'Similarity: {similarities[index]:.3f}')
    print(f'Document: {corpus[index]}\n')

Output

Index: 3418
Similarity: 0.884
Document: The Journal of Ayn Rand Studies, a multidisciplinary, peer-reviewed academic journal devoted to the study of Rand and her ideas, was established in 1999.

Index: 3415
Similarity: 0.884
Document: Yet, throughout literary academia, Ayn Rand is considered a philosopher."

Index: 3410
Similarity: 0.877
Document: The Philosophic Thought of Ayn Rand, a 1984 collection of essays about Objectivism edited by Den Uyl and Rasmussen, was the first academic book about Rand's ideas published after her death.

Index: 3422
Similarity: 0.865
Document: In 2012, the Pennsylvania State University Press agreed to take over publication of The Journal of Ayn Rand Studies, and the University of Pittsburgh Press launched an "Ayn Rand Society Philosophical Studies" series based on the Society's proceedings.

Index: 3412
Similarity: 0.863
Document: In 1987, Allan Gotthelf, George Walsh, and David Kelley co-founded the Ayn Rand Society, a group affiliated with the American Philosophical Association.

Running our semantic search pipeline, we get these 5 documents in descending order of similarity. The model did an impressive job; it returned documents that directly mention Ayn Rand, her literary works, and even her philosophy of Objectivism.

While BERT serves as an excellent foundational model in our exploration of semantic search, demonstrating its utility in retrieving relevant documents, it's important to recognize that this is just the tip of the iceberg. Better models will not only offer superior accuracy but also enhanced performance, significantly refining our ability to conduct precise and effective semantic searches. We will also delve into far more scalable and performant vector databases, providing the infrastructure to support millions of embeddings and millisecond-latency queries. Stay tuned!

GPU Acceleration

If you have a GPU available, you can leverage it to significantly speed up the embedding generation process. You can also use free GPU resources on a Jupyter Notebook hosted with Google Colab. Once you have initialized the BERT tokenizer and model, you can use the following approach to automatically detect and use the GPU:

Python

# select GPU if available
if torch.cuda.is_available():
    device = torch.device('cuda')

# otherwise use CPU
else:
  device = torch.device('cpu')

# move the BERT model to the GPU
model.to(device)

# set the default device type for tensors to the GPU
torch.set_default_device(device)

This code will automatically detect if a GPU is available using a framework called CUDA and move the model to the GPU if so. It will also set the default device type for tensors to the GPU so that we don't have to manually move them. Once you run this setup, the existing generate_bert_embeddings function will automatically use the GPU for embedding generation.