Meet ChromaDB
Before upgrading our Paul Graham Essay Search pipeline, let's meet the database that will do it: ChromaDB. ChromaDB is a fully open-source, developer-friendly, and Python-native vector database that is based on Hierarchical Navigable Small World (HNSW) ANN. It is designed to be simple to use and easy to integrate into existing Python-based semantic search pipelines. To install the package, simply run the following in your Python virtual environment:
pip install chromadb
The first step once you've installed ChromaDB is to instantiate a new client. In our case, we'll use chromadb.PersistentClient
to persist our embeddings to disk, but you can equally use the in-memory client for testing and development purposes with chromadb.EphemeralClient
. The client is the central reference to our database and the collections within it. In client-server mode, which we'll see later, the client is also responsible for managing the connection to the server.
import chromadb
client = chromadb.PersistentClient(path="/path/to/save/to")
Key Features
Before looking at any code, let's quickly review why ChromaDB is a great choice. Here are some of the key features that it offers:
- Fast CRUD operations: ChromaDB offers fast CRUD operations (create, read, update, delete), even as the database scales beyond millions of vectors.
- Plug-and-play embedding function: ChromaDB allows you to use any embedding function you want, including popular models from OpenAI, Cohere, and Hugging Face. This means you can easily query the vector database without worrying about generating the embeddings yourself.
- Customizable similarity metric: ChromaDB isn't limited to a single similarity metric. It supports multiple metrics, including cosine similarity and dot product, giving users the freedom to choose the best fit for their application.
- Metadata storage and filtering: ChromaDB allows you to store metadata alongside each vector and filter vectors based on this metadata in the query. We'll learn how to do this later.
- HTTP Client: In addition to in-memory and persistent clients, ChromaDB also supports an HTTP client. This means you can deploy the vector database on a separate server from your application and query it over the internet!
Building the Database
Creating the Collection
A collection in ChromaDB is an isolated container for a set of vectors that we want to store and query together, similar to a table in a relational database. When defining a new collection, we can specify the embedding function (i.e. the underlying model that we'll use to embed documents and queries) and the similarity metric that the collection should use (e.g. cosine similarity, dot product, etc).
ChromaDB provides many adapters for popular embedding models, but in our case, we'll use the adapter for OpenAI and specify the text-embedding-3-large
model that we learned about in the previous lesson. We'll also tell the collection to use cosine similarity as the similarity metric.
from chromadb.utils import embedding_functions
"""
We'll use ChromaDB's embedding function adapter for OpenAI and specify the target model. We'll also pass our OpenAI API key here.
"""
ada_2_embedding_function = embedding_functions.OpenAIEmbeddingFunction(
api_key="YOUR_API_KEY",
model_name="text-embedding-3-large"
)
"""
When we create the collection, we'll give it a name and specify both the embedding function and similarity metric. In this case, we provide the embedding function above and specify cosine similarity as the similarity metric.
"""
collection = client.create_collection(
name="collection_name",
embedding_function=ada_2_embedding_function,
metadata={"hnsw:space": "cosine"}
)
"""
In the future, once we've created our collection, we can simply retrieve it from disk using `client.get_collection`.
"""
collection = client.get_collection("collection_name")
Adding Documents
The next step is to add documents to the collection with collection.add
. If add
receives a list of documents
, it will automatically embed them using the collection's embedding function. We must also pass an equal-length ids
list, which is used to uniquely identify each document in the collection. Finally, we can optionally pass a list of metadatas
to store alongside each document, enabling metadata filtering in our queries. More on this later.
collection.add(
ids=["id1", "id2"],
documents=["doc1", "doc2"],
metadatas=[{"title": "A", "size": 141}, {"title": "B", "size": 51}]
)
Alternatively, if we've already embedded the documents ourselves, we can simply provide an equal-length list of embeddings
alongside documents
. For example, if we download an open-source, pre-embedded dataset, we'd want to add the embeddings directly to the collection rather than re-embedding them.
collection.add(
ids=["id1", "id2"],
documents=["doc1", "doc2"],
embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
metadatas=[{"title": "A", "size": 141}, {"title": "B", "size": 51}]
)
Other CRUD Operations
There are many other useful CRUD operations that ChromaDB supports for documents in a collection, including updating, upserting, deleting, and retrieving documents. All these CRUD operations are thoroughly documented in the ChromaDB usage guide.
Querying the Database
Now the fun part — querying the database using the underlying ANN algorithm! A ChromaDB collection can be queried in several ways using the collection.query
method. The simplest way is to pass one of query_embeddings
or query_texts
.
query_embeddings
will use pre-embedded queries, while query_texts
will use text queries that are embedded using the collection's embedding function. We can also pass in n_results
to specify the number of results to return (this is the parameter k
in the approximate k-NN search).
# query using query embeddings
results = collection.query(
query_embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
n_results=2
)
# query using query texts that are embedded using the collection's embedding function
results = collection.query(
query_texts=["query1", "query2"],
n_results=2
)
For each query, ChromaDB will return the top n_results
most similar documents in the collection. The results are returned in the below dictionary format, containing the ids
, distances
(cosine similarity scores in this case), metadatas
, and embeddings
for each matched document of each query.
{
"ids":[
["id1", "id3"], # ids for query 1
["id2", "id3"] # ids for query 2
],
"distances":[
[0.0946, 0.1234], # cosine distances for query 1
[0.1077, 0.1255] # cosine distances for query 2
],
"metadatas":[
[{"size":141, "title": "A"}, {"size": 0, "title": "C"}], # metadata for query 1
[{"size": 51, "title": "B"}, {"size": 0, "title": "C"}], # metadata for query 2
],
"embeddings": [
[[1.1, 2.3, 3.2], [0.1, 0.2, 0.3]], # embeddings for query 1
[[4.5, 6.9, 4.4], [0.1, 0.2, 0.3]] # embeddings for query 2
],
"documents":[
["doc1", "doc3"], # documents for query 1
["doc2", "doc3"] # documents for query 2
]
}
Filtering
Query filters are a powerful vector database feature that can drastically improve performance by reducing the number of documents that need to be searched. For example, if you have a collection of 10 million news articles but only need to search across articles from the last 24 hours, you can use filters to remove all articles older than 24 hours before performing the expensive ANN search. In ChromaDB, there are two ways to filter documents in a query: by metadata and by document contents.
The first method is to use the where
parameter to filter the documents based on their metadata. Here's what this looks like:
results = collection.query(
query_texts="query1",
n_results=2,
where={
"size": {
"$gt": 100
}
}
)
In this query, we first filter out any document with a size
metadata field less than or equal to 100 and only then actually search for the top 2 most similar documents. Note that if a document's metadata field does not contain one of the filter keys, it will be excluded from the query results. Here's a list of all the supported filter operators!
The second method is to use the where_document
parameter alongside either the $contains
or $not_contains
operator to filter the documents based on their contents. For example, the following filter will only query documents that contain the word "hello":
results = collection.query(
query_texts="query1",
n_results=2,
where_document={
"$contains": "hello"
}
)
Finally, we can also combine multiple filters using the $and
and $or
logical operators. For example, the following filter will only query documents with title = A
or size > 100
:
results = collection.query(
query_texts="query1",
n_results=2,
where={
"$or": [
{
"title": {
"$eq": "A"
}
},
{
"size": {
"$gt": 100
}
}
]
}
)
Remote ChromaDB
One final feature to highlight is ChromaDB's client-server mode. In production applications, you'll often want to run the vector database on a separate server from your application and query it over HTTP. You can start an instance of ChromaDB in server mode using the following CLI command (installed with the Python package), where db_path
is the path to the directory where you want to store the database:
chroma run --path /db_path
Alternatively, you can deploy the server using the official Docker image. Once the server is running, you can use the chromadb.HTTPClient
to query it over HTTP. All the client-level and collection-level methods that we saw above are also available to the HTTP client.
client = chromadb.HttpClient(host='localhost', port=8000)
collection = client.get_collection("collection_name")
# query collection over HTTP
client.query(
collection_name="collection_name",
query_texts="query1",
n_results=2
)
If you're interested in deploying ChromaDB yourself in a production capacity, the ChromaDB documentation provides additional guidance on best practices for deploying in production, authenticating access, and adding observability.