A Primer on Embeddings and Semantic Search
1

Technical Definition

Now that we've explored the fundamentals of text embeddings, let's try to explicitly define the concept of semantic search. As we've seen, the task of text retrieval uses a query to find the most relevant documents in a corpus. The application of text embeddings to this task, called semantic search, provides a powerful framework for matching queries with documents based on their semantic meaning. Other benefits include:

  • Improved Relevance: By understanding the intent and context, semantic search can provide more relevant results, often uncovering information that keyword-based searches might miss.
  • Handling Ambiguity & Mispelling: Semantic search is more adept at handling ambiguous queries, offering results that are contextually more appropriate. It can also overlook minor spelling mistakes or acronyms.
  • Multilingual Support: Since semantic search is based on the underlying meaning of words, it can be applied to any language, even those with limited training data.

Here is a quick step-by-step walk-through of how semantic search works:

  1. Embed the Corpus: First, we embed each document in the corpus using a pre-trained embedding model. This generates a vector representation for each document.
  2. Embed the Query: Next, we embed the query using the same embedding model. This generates a vector representation for the query.
  3. Calculate Similarity: Finally, we calculate the similarity between the query vector and each document vector using a similarity measure such as cosine similarity. This gives us a similarity score for each document, which we can use to rank the documents in order of relevance.
Semantic Search

Areas of Improvement

In practice, nearly every semantic search system will follow the above three steps. That being said, there are two big variables in this pipeline that will make the entire difference: the model and the vector database.

The model is the pre-trained embedding model that we use to embed the corpus and query. Using larger and more sophisticated models will allow us to project documents into higher-dimensional embedding spaces and capture their semantic meaning with greater accuracy. The vector database is the database that stores the vector representations of the documents in the corpus. These databases are purpose-built to efficiently store vectors and perform similarity scoring at scale.

Below, we will start with the basics: a simple model called doc2vec and a vector database that is simply an in-memory NumPy matrix holding our stacked document embeddings. Future lessons in this course will be geared around exploring the most advanced and powerful models and vector databases available today.