Open Engineer

Essay Search with OpenAI

Embedding the Corpus

The first embedding step is to embed each paragraph in our corpus. To do so, we'll use the Text Embedding v3 Large, as it offers the best performance. Let's define a function called generate_embeddings that takes in a list of document strings and returns the embeddings as a $(N, D)$ NumPy matrix, where $N$ is the number of documents and $D$ is the embedding dimensionality (3072 in this case).

Before passing the documents to the Embedding API, we will first tokenize them using the cl100k_base tokenizer and truncate them to fit within the 8191 token limit. This is mostly a precautionary measure since it's unlikely that a single paragraph will have more than ~8000 words.

Python

import numpy as np
import tiktoken
import openai

client = openai.OpenAI(api_key=API_KEY)
tokenizer = tiktoken.get_encoding("cl100k_base")

def generate_embeddings(documents: list[str]) -> np.ndarray:

	# tokenize and truncate long documents using the tokenizer
	tokenized_documents = []
	for document in documents:
		tokenized_document = tokenizer.encode(document)[:8191]
		tokenized_documents.append(tokenized_document)

	# embed tokenized documents using the Embedding API
	response = client.embeddings.create(
		input=tokenized_documents,
		model="text-embedding-3-large"
	)

	# convert embeddings (a list of lists of floats) to a NumPy matrix
	embeddings = [item.embedding for item in response.data]
	return np.array(embeddings)

Now, we can embed our corpus of paragraphs using the generate_embeddings function. Similar to our Wikipedia semantic search pipeline with BERT, we will batch our documents into groups of 1000 to avoid hitting any limits. Embedding the entire 7,533-paragraph corpus with this model will take about 30 seconds and cost an impressive ~$0.04.

Python

corpus_texts = [document['text'] for document in corpus]

batch_size = 1000
corpus_embeddings = np.vstack([
    generate_embeddings(corpus_texts[i:i+batch_size])
    for i in range(0, len(corpus_texts), batch_size)
])

Here, we've used the NumPy vstack function (similar to the PyTorch torch.vstack function we saw previously) to stack the embeddings from each batch along their first dimension to create a single NumPy matrix for the whole corpus. In other words, we stack several $(B, D)$ matrices into a single $(N, D)$ matrix, where $B$ is the batch size and $N$ is the corpus size. In actual numbers, the resulting matrix will have a shape of $(7533, 3072)$ .

Querying the Corpus

Now for the fun part: let's try searching the corpus. Having co-founded Y Combinator, Paul Graham often writes about startup tips, so let's try searching for the most relevant paragraphs that discuss the challenges of picking a good co-founder.

Python

from sklearn.metrics.pairwise import cosine_similarity

"""
We can start by defining our query and embedding using the same `generate_embeddings` function. In return, we'll get a single (1, D) matrix for our query embedding.
"""
query = "how to pick a good co-founder"
query_embedding = generate_embeddings([query])

"""
Next, we need to measure the similarity between this embedding and the entire corpus of paragraph embeddings. To do so, we'll use the `cosine_similarity` function from scikit-learn. This function takes in two matrices and returns a matrix of pairwise cosine similarities between the rows of the two matrices. In our case, we'll pass in our query embedding and the corpus embeddings to produce a single (1, N) matrix of similarity scores. Recall that cosine similarity scores range from -1 to 1, with higher scores indicating greater similarity. Finally, we remove the first dimension of the resulting matrix to get a single N-element NumPy array of similarity scores.
"""
similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]

"""
Now, let's print out the documents with the top 3 similarity scores! We can sort the indices of the similarity scores using `argsort()`, reverse the order using `[::-1]` to get the highest scores at the front, and then take the top 3 indices using `[:3]`. Finally, we can use these indices to print out the top 3 paragraphs from our corpus.
"""
top_k = 3
top_k_indices = similarities.argsort()[::-1][:top_k]
for index in top_k_indices:
	print(corpus[index])

Output

{
    "essay_title": "A Student's Guide to Startups",
    "essay_url": "http://www.paulgraham.com/mit.html",
    "paragraph_index": 68,
    "text": "Number two, make the most of the great advantage of school: the wealth of co-founders. Look at the people around you and ask yourself which you'd like to work with. When you apply that test, you may find you get surprising results. You may find you'd prefer the quiet guy you've mostly ignored to someone who seems impressive but has an attitude to match. I'm not suggesting you suck up to people you don't really like because you think one day they'll be successful. Exactly the opposite, in fact: you should only start a startup with someone you like, because a startup will put your friendship through a stress test. I'm just saying you should think about who you really admire and hang out with them, instead of whoever circumstances throw you together with."
}

{
    "essay_title": "Billionaires Build",
    "essay_url": "http://www.paulgraham.com/ace.html",
    "paragraph_index": 29,
    "text": "If the partners are sufficiently convinced that there's a path to a big market, the next question is whether you'll be able to find it. That in turn depends on three things: the general qualities of the founders, their specific expertise in this domain, and the relationship between them. How determined are the founders? Are they good at building things? Are they resilient enough to keep going when things go wrong? How strong is their friendship?"
}

{
    "essay_title": "How to Start a Startup",
    "essay_url": "http://www.paulgraham.com/start.html",
    "paragraph_index": 26,
    "text": "Ideally you want between two and four founders. It would be hard to start with just one. One person would find the moral weight of starting a company hard to bear. Even Bill Gates, who seems to be able to bear a good deal of moral weight, had to have a co-founder. But you don't want so many founders that the company starts to look like a group photo. Partly because you don't need a lot of people at first, but mainly because the more founders you have, the worse disagreements you'll have. When there are just two or three founders, you know you have to resolve disputes immediately or perish. If there are seven or eight, disagreements can linger and harden into factions. You don't want mere voting; you need unanimity."
}

How impressive is that? The results are not only very on-topic but are also sourced from 5 different essays. In these selected paragraphs, we witness Paul Graham discuss strategies for picking good co-founders, as well as attributes of healthy co-founder relationships.

Per our discussion earlier about Matryoshka Representation Learning (MRL), we can also experiment with using only the first 512 dimensions of the embeddings. While this isn't relevant for our current use case since we can easily fit the entire embedded corpus in memory, it's a useful technique for managing the cost and performance of large-scale semantic search applications.

Python

"""
We can keep only the first 512 embedding dimensions by slicing the (N, 3072) matrix of corpus embeddings along the second dimension to get a new (N, 512) matrix. We can then do the same for the (1, 3072) query embedding.
"""
similarities = cosine_similarity(query_embedding[:, :512], corpus_embeddings[:, :512])[0]

After shortening the embeddings, we actually get the same results as before! When applying these approaches to your own semantic search problems, it's important to experiment with different models and embedding lengths to find the best trade-off between cost and performance.

However, one issue that remains is that the entire corpus lives in memory and is therefore not persisted anywhere. If we reset our notebook session or restarted the program, we'd need to re-download and re-embed the entire Paul Graham essay collection. What if we wanted to access our Paul Graham Essay Search Tool™ from a responsive web application? We'd need to persist the corpus somewhere remote and access it via an API. In the next lesson, we'll learn how to do exactly that using vector databases like ChromaDB. Stay tuned!