Open Engineer

Preparing Our Customer Intent Dataset

The first stage of fine-tuning is to prepare our dataset. In many ways, this is the most critical part of fine-tuning, as we will see below. We will also learn about best practices and common pitfalls to avoid at each step in the process.

Data Sources

There are 4 common strategies that you might choose from to generate a fine-tuning dataset from scratch. It is also common to use a combination of these strategies:

Find a public dataset: If you're lucky, someone has already collected and shared a dataset that's perfect for your task. For example, the IMDb dataset is a popular dataset for sentiment analysis tasks. Your best bet is to search for datasets on Kaggle, Google Dataset Search, or Hugging Face.
Scrape the web: If you can't find a public dataset, you might be able to scrape the web for the data you need. For example, you could scrape product reviews from Amazon or posts from Reddit. This approach will typically require lots of data cleaning and deduplication.
Generate synthetic data: A common approach in the LLM community is to use more powerful models like GPT-4 to generate synthetic data for fine-tuning smaller models. It's helpful to give the model examples of the task you want it to learn, and then ask it to generate more examples. This approach is especially useful when you have a small dataset and you want to increase its size.
Collect data from humans: Since fine-tuned LLMs are usually served to humans, the most realistic data will always be generated by humans. Just make sure that the humans you're collecting data from are representative of the people who will be using your model. Also, if you're collecting data from humans in real-time, you can leverage online learning to keep your model up-to-date.

For our customer intent classification task, we've prepared a dataset of 500 high-quality examples that include realistic customer inquiries (e.g. mixed style and grammar) and an even distribution of categories and sub-categories. We've saved the dataset as a CSV with three columns: inquiry, category, and sub_category. Let's load the dataset into a pandas DataFrame to take a look at the first few rows:

Python

import pandas as pd

dataset_df = pd.read_csv('dataset.csv')
dataset_df.head(5)

Output

inquiry,category,sub_category
"How to know if I'm within the return period?",Returns and Exchanges,Return Policy
"Can I retrieve a cancelled order if it was a mistake?",Order Management,Cancellation
"when will my backorder arrive?",Product Information,Stock Status
"Need update on order #12345",Order Management,Status and Tracking
"Summer sale dates?",Product Information,Promotions

Quality and Size

Repeat with me: "The quality of a fine-tuned model is only as good as the quality of its fine-tuning dataset." If you feed your model garbage, it will learn to generate garbage.

The quality of your dataset is, without a doubt, the most important factor in the success of your fine-tuning process. Regardless of whether your dataset is scraped from the web, generated by humans, or even generated by another model, it's critical that it's denoised from irrelevant tokens, typos, or examples, representative of the task at hand, and diverse enough to reflect potential real-world inputs.

The second most important factor is the size of your dataset. Generally, the more data, the better, especially if your task is fairly complicated. Since it's pretty hard to over-train a model (unless your data sucks), it's always better to have more data than less. At least 100 examples is a good start, but if your task has lots of variability (e.g. bucketing text into one of many possible categories) or is generally hard for a model to generalize to, you'll want to aim for thousands of examples.

Input Format

As we saw in the introductory lesson, an individual example in our supervised fine-tuning dataset will consist of an input and an output. The input is the prompt you want to feed the model and the output is the response you'd like the model to learn.

In this case, OpenAI's fine-tuning platform expects each input-output pair to conform to a specific JSON format as shown below, consisting of a system prompt, user input, and assistant output. While there is no formal name for this format, it is sometimes referred to as the chat format and each section is referred to as a message:

JSON

{
	"messages": [
		{
			"role": "system",
			"content": "You are a helpful assistant." // system prompt
		},
		{
			"role": "user",
			"content": "Who won the world series in 2020?" // input
		},
		{
			"role": "assistant",
			"content": "The Dodgers won in 2020." // output
		}
	]
}

This format is used widely in the LLM community and is the standard format for OpenAI's fine-tuning and completions APIs. As seen here, we first provide a system prompt (also known as the instruction) to help guide the model, followed by the user's input, and finally the model's response (called the "assistant" here).

When we pass our formatted examples to OpenAI's fine-tuning platform, the final assistant message is the output that the model will learn to generate given the system and user messages. For our fine-tuning use case, the system prompt will be the same for each example: "Classify the user message into a category and sub-category and output the result as JSON." The user's input will simply be the customer inquiry and the model's response will be the stringified JSON output of the category and sub-category.

🔎 By prepending the example with a system prompt, we're actually using a fine-tuning technique called instruction tuning! Typically, we'd fine-tune on many different tasks, each with its own prepended instruction, which allows the LLM to develop robust, multi-task generalization skills. In our case though, we're only fine-tuning on our customer intent classification task.

Let's now take our dataset above and adapt it to the above chat format. To do so, for each row, we will first generate a stringified version of the JSON output that the model will learn by using Python's json.dumps to dump the category and sub-category into a JSON string. Then, we will assemble the full example using the format we described above! Let's print out the first example to see what it looks like:

Python

import json

instruction = "Classify the user message into a category and sub-category and output the result as JSON."

processed_data = []
for index, row in dataset_df.iterrows():
    classification_json_string = json.dumps({
        "category": row["category"],
        "sub_category": row["sub_category"],
    })

    processed_data.append({
        "messages": [
            {
				"role": "system",
				"content": instruction
			},
            {
				"role": "user",
				"content": row["inquiry"]
			},
            {
				"role": "assistant",
				"content": classification_json_string
			}
        ]
    })

processed_data[0]

Output

{
    'messages': [
        {'role': 'system',  'content': 'Classify the user message into a category and sub-category and output the result as JSON.'},
        {'role': 'user', 'content': 'How do I set up a new account?'},
        {'role': 'assistant', 'content': '{"category": "Account Management", "sub_category": "Setup"}'}
    ]
}

Picking a Validation Split

The final step in preparing our dataset is to divide it into training and validation splits (a.k.a. sets). The training split is used to fine-tune the model, while the validation split is used to evaluate the model's performance throughout fine-tuning.

Why do we need this? This separation is necessary because it helps identify whether the model is overfitting—memorizing the training data rather than learning to generalize from it. Without a validation split, we would lack a reliable method to gauge if the model can perform well on new, unseen examples, potentially compromising its effectiveness in real-world applications.

A common practice is to use 80% of the data for training and 20% for validation, which we can do using the train_test_split function from the sklearn library. This function uses an approach called stratified sampling to randomly sample examples while also ensuring that the distribution of categories remains even in both the training and validation splits. Let's split our dataset and print out the number of examples in each set:

Python

train_data, validation_data = train_test_split(
    processed_data,
    test_size=0.20, # 20% of the data will be used for validation
    stratify=dataset_df['category'], # the column to distribute evenly (i.e. stratify)
    random_state=42 # a random seed for reproducibility
)

len(train_data), len(validation_data)

Output

(400, 100)