Scaling Up Search with Vector Databases
4

Meet ChromaDB

Before upgrading our Paul Graham Essay Search pipeline, let's meet the database that will do it: ChromaDB. ChromaDB is a fully open-source, developer-friendly, and Python-native vector database that is based on Hierarchical Navigable Small World (HNSW) ANN. It is designed to be simple to use and easy to integrate into existing Python-based semantic search pipelines. To install the package, simply run the following in your Python virtual environment:

Bash
pip install chromadb

The first step once you've installed ChromaDB is to instantiate a new client. In our case, we'll use chromadb.PersistentClient to persist our embeddings to disk, but you can equally use the in-memory client for testing and development purposes with chromadb.EphemeralClient. The client is the central reference to our database and the collections within it. In client-server mode, which we'll see later, the client is also responsible for managing the connection to the server.

Python
import chromadb

client = chromadb.PersistentClient(path="/path/to/save/to")

Key Features

Before looking at any code, let's quickly review why ChromaDB is a great choice. Here are some of the key features that it offers:

  • Fast CRUD operations: ChromaDB offers fast CRUD operations (create, read, update, delete), even as the database scales beyond millions of vectors.
  • Plug-and-play embedding function: ChromaDB allows you to use any embedding function you want, including popular models from OpenAI, Cohere, and Hugging Face. This means you can easily query the vector database without worrying about generating the embeddings yourself.
  • Customizable similarity metric: ChromaDB isn't limited to a single similarity metric. It supports multiple metrics, including cosine similarity and dot product, giving users the freedom to choose the best fit for their application.
  • Metadata storage and filtering: ChromaDB allows you to store metadata alongside each vector and filter vectors based on this metadata in the query. We'll learn how to do this later.
  • HTTP Client: In addition to in-memory and persistent clients, ChromaDB also supports an HTTP client. This means you can deploy the vector database on a separate server from your application and query it over the internet!

Building the Database

Creating the Collection

A collection in ChromaDB is an isolated container for a set of vectors that we want to store and query together, similar to a table in a relational database. When defining a new collection, we can specify the embedding function (i.e. the underlying model that we'll use to embed documents and queries) and the similarity metric that the collection should use (e.g. cosine similarity, dot product, etc).

ChromaDB provides many adapters for popular embedding models, but in our case, we'll use the adapter for OpenAI and specify the text-embedding-3-large model that we learned about in the previous lesson. We'll also tell the collection to use cosine similarity as the similarity metric.

Python
from chromadb.utils import embedding_functions

"""
We'll use ChromaDB's embedding function adapter for OpenAI and specify the target model. We'll also pass our OpenAI API key here.
"""
ada_2_embedding_function = embedding_functions.OpenAIEmbeddingFunction(
	api_key="YOUR_API_KEY",
	model_name="text-embedding-3-large"
)

"""
When we create the collection, we'll give it a name and specify both the embedding function and similarity metric. In this case, we provide the embedding function above and specify cosine similarity as the similarity metric.
"""
collection = client.create_collection(
	name="collection_name",
	embedding_function=ada_2_embedding_function,
	metadata={"hnsw:space": "cosine"}
)

"""
In the future, once we've created our collection, we can simply retrieve it from disk using `client.get_collection`.
"""
collection = client.get_collection("collection_name")

Adding Documents

The next step is to add documents to the collection with collection.add. If add receives a list of documents, it will automatically embed them using the collection's embedding function. We must also pass an equal-length ids list, which is used to uniquely identify each document in the collection. Finally, we can optionally pass a list of metadatas to store alongside each document, enabling metadata filtering in our queries. More on this later.

Python
collection.add(
	ids=["id1", "id2"],
    documents=["doc1", "doc2"],
    metadatas=[{"title": "A", "size": 141}, {"title": "B", "size": 51}]
)

Alternatively, if we've already embedded the documents ourselves, we can simply provide an equal-length list of embeddings alongside documents. For example, if we download an open-source, pre-embedded dataset, we'd want to add the embeddings directly to the collection rather than re-embedding them.

Python
collection.add(
	ids=["id1", "id2"],
	documents=["doc1", "doc2"],
	embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
	metadatas=[{"title": "A", "size": 141}, {"title": "B", "size": 51}]
)

Other CRUD Operations

There are many other useful CRUD operations that ChromaDB supports for documents in a collection, including updating, upserting, deleting, and retrieving documents. All these CRUD operations are thoroughly documented in the ChromaDB usage guide.

Querying the Database

Now the fun part — querying the database using the underlying ANN algorithm! A ChromaDB collection can be queried in several ways using the collection.query method. The simplest way is to pass one of query_embeddings or query_texts.

query_embeddings will use pre-embedded queries, while query_texts will use text queries that are embedded using the collection's embedding function. We can also pass in n_results to specify the number of results to return (this is the parameter k in the approximate k-NN search).

Python
# query using query embeddings
results = collection.query(
	query_embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
	n_results=2
)

# query using query texts that are embedded using the collection's embedding function
results = collection.query(
	query_texts=["query1", "query2"],
	n_results=2
)

For each query, ChromaDB will return the top n_results most similar documents in the collection. The results are returned in the below dictionary format, containing the ids, distances (cosine similarity scores in this case), metadatas, and embeddings for each matched document of each query.

Python
{
   	"ids":[
		["id1", "id3"], # ids for query 1
		["id2", "id3"]  # ids for query 2
	],
   	"distances":[
		[0.0946, 0.1234], # cosine distances for query 1
		[0.1077, 0.1255]  # cosine distances for query 2
	],
	"metadatas":[
    	[{"size":141, "title": "A"}, {"size": 0, "title": "C"}], # metadata for query 1
		[{"size": 51, "title": "B"}, {"size": 0, "title": "C"}], # metadata for query 2
   ],
   "embeddings": [
		[[1.1, 2.3, 3.2], [0.1, 0.2, 0.3]], # embeddings for query 1
		[[4.5, 6.9, 4.4], [0.1, 0.2, 0.3]]  # embeddings for query 2
   ],
   "documents":[
		["doc1", "doc3"], # documents for query 1
		["doc2", "doc3"]  # documents for query 2
   	]
}

Filtering

Query filters are a powerful vector database feature that can drastically improve performance by reducing the number of documents that need to be searched. For example, if you have a collection of 10 million news articles but only need to search across articles from the last 24 hours, you can use filters to remove all articles older than 24 hours before performing the expensive ANN search. In ChromaDB, there are two ways to filter documents in a query: by metadata and by document contents.

The first method is to use the where parameter to filter the documents based on their metadata. Here's what this looks like:

Python
results = collection.query(
	query_texts="query1",
	n_results=2,
	where={
		"size": {
			"$gt": 100
		}
	}
)

In this query, we first filter out any document with a size metadata field less than or equal to 100 and only then actually search for the top 2 most similar documents. Note that if a document's metadata field does not contain one of the filter keys, it will be excluded from the query results. Here's a list of all the supported filter operators!

The second method is to use the where_document parameter alongside either the $contains or $not_contains operator to filter the documents based on their contents. For example, the following filter will only query documents that contain the word "hello":

Python
results = collection.query(
	query_texts="query1",
	n_results=2,
	where_document={
		"$contains": "hello"
	}
)

Finally, we can also combine multiple filters using the $and and $or logical operators. For example, the following filter will only query documents with title = A or size > 100:

Python
results = collection.query(
	query_texts="query1",
	n_results=2,
	where={
		"$or": [
			{
				"title": {
					"$eq": "A"
				}
			},
			{
				"size": {
					"$gt": 100
				}
			}
		]
	}
)

Remote ChromaDB

One final feature to highlight is ChromaDB's client-server mode. In production applications, you'll often want to run the vector database on a separate server from your application and query it over HTTP. You can start an instance of ChromaDB in server mode using the following CLI command (installed with the Python package), where db_path is the path to the directory where you want to store the database:

Bash
chroma run --path /db_path

Alternatively, you can deploy the server using the official Docker image. Once the server is running, you can use the chromadb.HTTPClient to query it over HTTP. All the client-level and collection-level methods that we saw above are also available to the HTTP client.

Python
client = chromadb.HttpClient(host='localhost', port=8000)
collection = client.get_collection("collection_name")

# query collection over HTTP
client.query(
	collection_name="collection_name",
	query_texts="query1",
	n_results=2
)

If you're interested in deploying ChromaDB yourself in a production capacity, the ChromaDB documentation provides additional guidance on best practices for deploying in production, authenticating access, and adding observability.