Upgrading Essay Search
Migrating to ChromaDB
Alright, now that we've covered the boring stuff, let's upgrade our Paul Graham Essay Search Tool™ to use ChromaDB. We can start by instantiating a new ChromaDB client and creating a new collection, following the same steps that we learned above.
import chromadb
from chromadb.utils import embedding_functions
"""
One of the issues with our original implementation was that we were not persisting our embeddings anywhere. Let's fix that using the persistent client.
"""
client = chromadb.PersistentClient('db')
"""
We'll use the OpenAI adapter for our embedding function and specify the same OpenAI Text Embedding v3 Large model that we used previously.
"""
ada_2_embedding_function = embedding_functions.OpenAIEmbeddingFunction(
api_key=API_KEY,
model_name="text-embedding-3-large"
)
"""
The `client.get_or_create_collection` method will create a new collection if it doesn't exist or retrieve an existing collection if it does exist. This will make this code idempotent, so we can run it multiple times without creating duplicate collections.
"""
collection = client.get_or_create_collection(
name="paul_graham_essay_search",
embedding_function=ada_2_embedding_function,
metadata={"hnsw:space": "cosine"}
)
Now, we need to add the documents. Previously, in our corpus building and embedding steps, we created a corpus
list containing a dictionary of data for each document, a corpus_texts
list containing the text for each document, and a corpus_embeddings
list containing the embeddings for each document. Go familiarize yourself if you haven't seen the previous implementation!
To add the documents to ChromaDB, we'll need to create an ids
list containing a unique ID for each document, a documents
list containing the text for each document, and a metadatas
list containing the metadata for each document. We'll also pass embeddings
in order to reuse the embeddings that we've already generated.
"""
We'll create a list of unique IDs for each document by combining the essay title and paragraph index. We'll also make it all lowercase and replace spaces with underscores.
"""
ids = [
document['essay_title'].lower().replace(' ', '_')
+ '_' + str(document['paragraph_index'])
for document in corpus
]
"""
This one is easy. Our `corpus_texts` list already contains the text for each document.
"""
documents = corpus_texts
"""
We need `embeddings` to be a list of embedding lists (i.e. a list of lists of floats). Thankfully, NumPy provides a convenient `tolist` method that will convert our (7533, 1536) NumPy matrix directly to this format.
"""
embeddings = corpus_embeddings.tolist()
"""
Finally, we'll create a list of dictionaries containing the metadata for each document. To do so, we'll simply extract the `essay_title` and `paragraph_index` from each document in the corpus.
"""
metadatas = [
{
"essay_title": document["essay_title"],
"paragraph_index": document["paragraph_index"],
}
for document in corpus
]
The last step is to add everything to the collection! Using ChromaDB's collection.add
method, we can add every document to the collection in one go. This may take a minute since ChromaDB needs to persist all the metadata and insert each embedding vector into the underlying ANN algorithm.
collection.add(
ids=ids,
documents=documents,
embeddings=embeddings,
metadatas=metadatas
)
Querying ChromaDB
Using the built-in embedding function, we can query the collection without worrying about embedding the queries ourselves. We'll also pass n_results=1
to only return the most similar paragraph and include=["documents"]
to only include the document text in the results (this is helpful when using the HTTP client to avoid sending unnecessary data over the network).
results = collection.query(
query_texts="how to pick a good co-founder",
n_results=1,
include=["documents"]
)
print(results["documents"][0][0])
Number two, make the most of the great advantage of school: the wealth of co-founders. Look at the people around you and ask yourself which you'd like to work with. When you apply that test, you may find you get surprising results. You may find you'd prefer the quiet guy you've mostly ignored to someone who seems impressive but has an attitude to match. I'm not suggesting you suck up to people you don't really like because you think one day they'll be successful. Exactly the opposite, in fact: you should only start a startup with someone you like, because a startup will put your friendship through a stress test. I'm just saying you should think about who you really admire and hang out with them, instead of whoever circumstances throw you together with.
Now let's try using a filter! For example, we can retrieve only opening paragraphs that discuss the topic of "US immigration policy". This filter will leverage the paragraph_index
metadata field that we parsed in the previous lesson:
results = collection.query(
query_texts="US immigration policy",
n_results=1,
where={
"paragraph_index": {
"$eq": 0
}
},
include=["documents"]
)
print(results["documents"][0][0])
American technology companies want the government to make immigration easier because they say they can't find enough programmers in the US. Anti-immigration people say that instead of letting foreigners take these jobs, we should train more Americans to be programmers. Who's right?
There we go! We've successfully migrated our semantic search pipeline to use ChromaDB. Querying the corpus just got a whole lot simpler and we can safely scale up to many millions of documents.