Open Engineer

Meet OpenAI's Embedding Models

Model History

Let's start by meeting the OpenAI model family. OpenAI originally introduced their Embedding API in January 2022, along with the release of 4 embedding models based on GPT-3. These original models were designed to offer a spectrum of performance and cost, ranging from the fast, cheap, and low-accuracy Ada-1 to the slow, expensive, and accurate DaVinci-1. Between these two extremes were the middle-of-the-road Babbage-1 and Curie-1 models.

In December 2022, OpenAI released a single second-generation model, Ada-2, which improved on Ada-1's performance. But the real kicker came a year later in January 2024, with the release of their third-generation models, called the Text Embedding v3 family. The release featured 2 models: text-embedding-3-small, a small and highly efficient model, and text-embedding-3-large, a larger and more powerful next-generation model.

Features

Architecture

The first-generation models were based on GPT-3, the well-known transformer-based LLM pioneered by OpenAI in 2020. However, in classic OpenAI fashion, all subsequent model releases have been entirely closed-source.

The one interesting tidbit that OpenAI has revealed about the latest Text Embedding v3 models is that they leverage a training technique called Matryoshka Representation Learning (MRL). MRL trains your embedding models in such a way that you can actually shorten the model's output embeddings (by removing dimensions from the end) without losing much of the model's representation power. As a result, developers can effectively own the trade-off between embedding size and representation power, which is critical for managing the cost and performance of large-scale semantic search applications. More on this later!

Dimensions

As highlighted above, the Text Embedding v3 models are designed to be highly flexible in terms of the size of the embeddings they generate. By default, text-embedding-3-small generates 1536-dimensional embeddings, while text-embedding-3-large generates 3072-dimensional embeddings. Here is a comparison of the three models' performance on the MTEB benchmark, which measures the performance of text embedding models on diverse embedding tasks:

The main thing to notice here is that even if you only keep the first 512 dimensions of the text-embedding-3-small embeddings, you will outperform Ada-2 on MTEB! Why are smaller embeddings important? Because:

Smaller embeddings require less compute to generate, making them both faster and cheaper to generate with the API.
Smaller embeddings can be compared with less compute (recall from A Primer on Text Embeddings), making them faster to use in semantic search applications.
Smaller embeddings use less storage, making semantic search pipelines easier to scale and maintain for less.

Tokenizer

OpenAI's embedding models use a custom tokenizer called cl100k_base, featuring an impressive vocabulary of ~100k tokens. We can test out the tokenizer programmatically using the tiktoken Python package. Alternatively, you can head to OpenAI's tokenizer demo to test out the tokenizer behavior in your browser.

Python

import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")

text = 'Hello, world!'
tokenized_text = tokenizer.encode(text)
print(tokenized_text)

Output

[9906, 11, 1917, 0]

Input Size

The latest OpenAI embedding models feature a maximum input size of 8191 tokens, which is a 4x improvement over the first-generation models at 2046 tokens. Since a single token roughly translates to three-quarters of a word, this means that you can embed documents with up to 6,100 words.

Pricing

Now for the most exciting part. The new Text Embedding v3 models are not only more powerful but are also much cheaper to use than Ada-2 (and first-generation models). Here is a comparison of the pricing in terms of the number of pages of text (assuming each page consists of 800 tokens) that can be embedded for a single dollar of cost:

That's a lot of embedding power for a single dollar! In terms of exact pricing, text-embedding-3-small usage will cost $0.02 per 1M tokens and text-embedding-3-large usage will cost $0.13 per 1M tokens. The exact pricing for the Text Embedding v3 models can be found on the OpenAI pricing page.