Meet OpenAI's Embedding Models
Model History
Let's start by meeting the OpenAI model family. OpenAI originally introduced their Embedding API in January 2022, along with the release of 4 embedding models based on GPT-3. These original models were designed to offer a spectrum of performance and cost, ranging from the fast, cheap, and low-accuracy Ada-1 to the slow, expensive, and accurate DaVinci-1. Between these two extremes were the middle-of-the-road Babbage-1 and Curie-1 models.
In December 2022, OpenAI released a single second-generation model, Ada-2, which improved on Ada-1's performance. But the real kicker came a year later in January 2024, with the release of their third-generation models, called the Text Embedding v3 family. The release featured 2 models: text-embedding-3-small
, a small and highly efficient model, and text-embedding-3-large
, a larger and more powerful next-generation model.
Features
Architecture
The first-generation models were based on GPT-3, the well-known transformer-based LLM pioneered by OpenAI in 2020. However, in classic OpenAI fashion, all subsequent model releases have been entirely closed-source.
The one interesting tidbit that OpenAI has revealed about the latest Text Embedding v3 models is that they leverage a training technique called Matryoshka Representation Learning (MRL). MRL trains your embedding models in such a way that you can actually shorten the model's output embeddings (by removing dimensions from the end) without losing much of the model's representation power. As a result, developers can effectively own the trade-off between embedding size and representation power, which is critical for managing the cost and performance of large-scale semantic search applications. More on this later!
Dimensions
As highlighted above, the Text Embedding v3 models are designed to be highly flexible in terms of the size of the embeddings they generate. By default, text-embedding-3-small
generates 1536-dimensional embeddings, while text-embedding-3-large
generates 3072-dimensional embeddings. Here is a comparison of the three models' performance on the MTEB benchmark, which measures the performance of text embedding models on diverse embedding tasks:
The main thing to notice here is that even if you only keep the first 512 dimensions of the text-embedding-3-small
embeddings, you will outperform Ada-2 on MTEB! Why are smaller embeddings important? Because:
- Smaller embeddings require less compute to generate, making them both faster and cheaper to generate with the API.
- Smaller embeddings can be compared with less compute (recall from A Primer on Text Embeddings), making them faster to use in semantic search applications.
- Smaller embeddings use less storage, making semantic search pipelines easier to scale and maintain for less.
Tokenizer
OpenAI's embedding models use a custom tokenizer called cl100k_base
, featuring an impressive vocabulary of ~100k tokens. We can test out the tokenizer programmatically using the tiktoken
Python package. Alternatively, you can head to OpenAI's tokenizer demo to test out the tokenizer behavior in your browser.
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
text = 'Hello, world!'
tokenized_text = tokenizer.encode(text)
print(tokenized_text)
[9906, 11, 1917, 0]
Input Size
The latest OpenAI embedding models feature a maximum input size of 8191 tokens, which is a 4x improvement over the first-generation models at 2046 tokens. Since a single token roughly translates to three-quarters of a word, this means that you can embed documents with up to 6,100 words.
Pricing
Now for the most exciting part. The new Text Embedding v3 models are not only more powerful but are also much cheaper to use than Ada-2 (and first-generation models). Here is a comparison of the pricing in terms of the number of pages of text (assuming each page consists of 800 tokens) that can be embedded for a single dollar of cost:
That's a lot of embedding power for a single dollar! In terms of exact pricing, text-embedding-3-small
usage will cost $0.02 per 1M tokens and text-embedding-3-large
usage will cost $0.13 per 1M tokens. The exact pricing for the Text Embedding v3 models can be found on the OpenAI pricing page.