A Primer on Embeddings and Semantic Search
1

Similarity Scoring

As we've seen, embedding vectors that are close to one another within the embedding space often represent words that are semantically similar. The "closeness" between embeddings is therefore a fundamental aspect of text retrieval and NLP, prompting the need for a way to quantify this similarity.

This is where similarity scoring comes into play. It provides a quantitative way to measure how much two words (or phrases) are semantically related. As you can imagine, this is a crucial component of text retrieval systems, which rely on similarity scoring to match queries with the most relevant (or "similar") documents.

Dot Product

A simple yet effective way to measure similarity between two vectors is by using the dot product. The dot product of two vectors is the summation of the products of their elements. In other words, simply multiply the two vectors together and then add up their entries. Here is the formula:

ab=i=1naibi\vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i b_i

For example, if you had two vectors a=[5,3,9]\vec{a} = [5, 3, 9] and b=[5,10,0]\vec{b} = [5, -10, 0], their dot product would be:

ab=5×5+3×10+9×0=2530+0=5\vec{a} \cdot \vec{b} = 5 \times 5 + 3 \times -10 + 9 \times 0 = 25 - 30 + 0 = -5

If two matching elements are both very positive, they will contribute a large positive value to the dot product. If two matching elements are both very negative, they will also contribute a large positive value to the dot product. On the other hand, if two matching elements have opposite signs or are close to zero, they will contribute a small or negative value to the dot product. Thus, similar corresponding elements contribute to a higher score, making dot product a great measure of similarity between two vectors.

Let's put this into practice by comparing the embedding vector for the word "king" with the embedding vectors for "queen", "man", "fruit", and "transport".

Python
"""
(1) We'll begin by defining a function that calculates the dot product between two vectors. We'll use NumPy's `dot` function, which takes two NumPy arrays as input and returns their dot product.
"""
def calculate_dot_product(vec1: np.ndarray, vec2: np.ndarray) -> float:
    return np.dot(vec1, vec2)

"""
(BONUS) Just for fun, we can include a naive way to calculate the dot product as well. We'll just iterate through the vectors and multiply the corresponding elements together, adding the products to a running sum.
"""
def calculate_dot_product_naive(vec1: np.ndarray, vec2: np.ndarray) -> float:
	dot_product = 0
	for i in range(len(vec1)):
		dot_product += vec1[i] * vec2[i]
	return dot_product

"""
(2) Next, let's generate embedding vectors for the words we want to compare. We'll use the same GloVe model we used in the previous code demo.
"""
king_embedding = model['king']
queen_embedding = model['queen']
man_embedding = model['man']
fruit_embedding = model['fruit']
transport_embedding = model['transport']

"""
(3) Finally, let's compare the embedding for "king" with the embeddings for "queen", "man", "fruit", and "transport".
"""
print('king vs queen:', calculate_dot_product(king_embedding, queen_embedding))
print('king vs man:', calculate_dot_product(king_embedding, man_embedding))
print('king vs fruit:', calculate_dot_product(king_embedding, fruit_embedding))
print('king vs transport:', calculate_dot_product(king_embedding, transport_embedding))
Output
king vs queen: 17.370674
king vs man: 19.135818
king vs fruit: 14.976494
king vs transport: 6.333047

The output of this code shows that our embedding for "king" is most similar to the embeddings for "man" and "queen" and most dissimilar to the embedding for "transport". This makes sense! While some would expect "king" and "queen" to be far apart since they're opposites, the embedding model has learned that they are very similar in meaning, as they are both royal titles. On the other hand, "king" and "transport" are very different in meaning, so the embedding model has learned to place them far apart.

Cosine Similarity

The dot product is a straightforward way to measure similarity between word embeddings. However, its effectiveness diminishes in high-dimensional spaces like the ones used for text embeddings. This is where cosine similarity becomes a valuable alternative.

The key issue with using dot product for similarity in high-dimensional spaces lies in the role of vector magnitude (also called vector length). Imagine two vectors in a 50-dimensional space: one vector is relatively short but points in a similar direction as a much longer vector. Their dot product could be relatively high, primarily because of the length of the longer vector, not because they are pointing in the same direction. This is problematic in text embeddings, where we want to focus on the semantic similarity (direction of the vectors) rather than the frequency or intensity of the word usage (magnitude of the vectors).

🔎 In this context, the terms "magnitude", "length" and "norm" all refer to the same idea. You can intuit what "length" means in the context of vectors by once again imagining them as points in 2D or 3D space, where vector length is equivalent to the distance from the origin to the point. A "long" vector is therefore just one that's further from the origin. The concept of length is harder to visualize in higher dimensions, but the idea is the same.

To counter the influence of vector length, we can normalize our vectors. Normalization here means adjusting the vectors so that they all have a unit length (i.e. a length of 1). This process strips away the impact of magnitude, allowing us to focus solely on the direction in which the vectors point.

In mathematical terms, we achieve this by dividing the dot product of the two vectors by the product of the magnitudes of the two vectors. This approach is the same as first normalizing each vector to a unit length and then taking their dot product. Doing so, we end up with a similarity measure that ranges from -1 (indicating opposite meanings) to 1 (identical meanings), with 0 denoting no similarity. The formula for cosine similarity is:

cosine similarity=abab\text{cosine similarity} = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}

Here, ab\vec{a} \cdot \vec{b} is the dot product of the two vectors and a\|\vec{a}\| and b\|\vec{b}\| are the magnitudes (also called vector norms) of the two vectors. The norms are calculated as the square root of the sum of the squares of the vector's elements:

a=i=1nai2\|\vec{a}\| = \sqrt{\sum_{i=1}^{n} a_i^2}

Interestingly, this normalized dot product is equivalent to calculating the cosine of the angle between the two vectors, leading us to cosine similarity. This makes sense: if two vectors are pointing in the same direction, the cosine of their angle will be 1, indicating maximum similarity. If they are pointing in opposite directions, their cosine will be -1, indicating maximum dissimilarity. If they are perpendicular, their cosine will be 0, indicating no similarity.

To illustrate the practical application of cosine similarity in text embeddings, let's use Gensim's most_similar feature. Under the hood, this function will efficiently compute the cosine similarity between the embedding for a target word against the embeddings for every other word in the model's vocabulary. It will then return the words with the top n highest similarity scores. Interestingly, the most_similar routine is effectively performing a semantic search across a corpus where each document is a word in the model's vocabulary.

We'll use the same GloVe model as before to perform this semantic search against the model's vocabulary of 1.2 million words using the query "technology". Let's see this in action:

Python
"""
We'll call the `most_similar` function on our GloVe model, passing in the target word "technology" and the number of similar words we want to retrieve (n=5 in this case). The function will return a list of tuples, where each tuple contains a similar word and its similarity score. We'll then iterate through the list and print out each word and its similarity score.
"""
similar_words = model.most_similar('technology', topn=5)
print("Words most similar to 'technology':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.3f}")
Output
tech: 0.904
systems: 0.895
development: 0.891
enterprise: 0.884
computing: 0.869

The output contains the 5 most similar words to "technology" and their respective scores. We can see that the words are all related to technology, with the top 5 words being "tech", "systems", "development", "enterprise", and "computing". The similarity scores here are all very high, approaching 1.0 which is the maximum possible cosine similarity.

It's important to note that the resulting words are not just synonyms or directly related terms. Instead, they reflect the contextual usage and associations found in the training data. For instance, words that frequently appear in Tweets about technology or similar contexts as "technology" are likely to have higher similarity scores.

This example demonstrates how cosine similarity, applied through tools like Gensim, allows us to explore and understand the semantic relationships in text data. It's a crucial technique in text retrieval, enabling us to move beyond simple keyword matching to understanding deeper, context-driven relationships between words.