Open Engineer

Introduction

So far, we've seen how to apply semantic search to small text corpora with up to 10,000 documents. At these scales, we're able to fit all of our document embeddings in memory and quickly perform an exhaustive search over all of them. But what happens if we want to scale up to millions or even billions of documents? Several issues arise that we haven't yet considered:

Persistence: It is too costly to store our embeddings in memory and regenerate them every time we instantiate our semantic search pipeline. We need a way to persist our embeddings to disk so that they can be queried reliably around the clock.
Storage: Closely related to the first point, we need a cost-effective storage solution as our corpus grows in size. In the previous lesson, we generated 3,072-dimensional embeddings. If we performed a semantic search over a corpus of 10 million documents, we would need to store 30.72 billion float point values or approximately 122 GB of data (assuming standard 32-bit floating points). On top of this, we need to retain the ability to quickly insert and delete embeddings as our corpus grows in size.
Performance: The most important and obvious consideration is performance. When we calculated the cosine similarity between our corpus of 7,533 embeddings and our query embedding, the total time elapsed came out to about 80 milliseconds. Imagine if we had to perform this calculation for 10 million documents — it would take several minutes for a single query!

In this lesson, we'll explore how to address these issues using vector databases, a specialized type of database that is optimized for storing and querying vector data. We'll also introduce ChromaDB, a simple-to-use, Python-based vector database, and use it to upgrade our Paul Graham Essay Search Tool™.