A Deep-Dive Into Fine-Tuning
0

Fine-Tuning Variants

The supervised fine-tuning (SFT) process that we've discussed is just one of many related fine-tuning variants. If you have unique requirements, such as if you're trying to teach multiple tasks at once or align the model with human preferences, you might want to try framing your fine-tuning process in one of the following ways:

Instruction Fine-Tuning

Instruction fine-tuning, also called instruction tuning, is an improvement to standard fine-tuning that prepends instructions before the input-output examples. By doing so, the model is able to generalize to multiple tasks which can be triggered at inference time by prompting with similar instructions. These models feature greatly improved zero-shot instruction-following capabilities (you can read more about this approach here).

Google's FLAN popularized these methods by fine-tuning pre-trained LLMs on massive instruction datasets with up to 15M examples. The paper explains that, "[FLAN] involves fine-tuning a model not to solve a specific task, but to make it more amenable to solving NLP tasks in general." Here's a brief illustration of how this is staged compared to standard fine-tuning:

Instruction Tuning

As we'll see in the subsequent lessons, instruction tuning is pervasive in the fine-tuning ecosystem and a necessary step in production-grade chat LLMs. To perform instruction fine-tuning, we add a system prompt or instruction before the input-output pair in each example. Here's what a single instruction-based example might look like in practice:

JSON
{
	"system": "Answer the following question using Shakespearean English.",
	"input": "What is the capital of France?",
	"output": "The fair capital of France doth be Paris."
}

By fine-tuning on a dataset with millions of examples across a variety of different instructions, the model will learn to generalize to many tasks and instructions and showcase an overall improvement in instruction-following behavior. Another emergent behavior is that if you fine-tune the model repeatedly with the same instruction, you can trigger the trained behavior at inference time by prompting with that instruction (you're effectively activating certain learned pathways at inference time).

Task-Specific Fine-Tuning

Task-specific fine-tuning is the process of fine-tuning an LLM on a dataset that is specifically tailored for a single task, such as sentiment analysis for Tweets. Task-specific fine-tuning can be particularly effective if used on a model that has been previously fine-tuned for another related task — this is called transfer learning. One of the biggest downsides of this approach is the risk of catastrophic forgetting, in which the model abruptly forgets previously learned information.

Expanding beyond single-task focus, multi-task fine-tuning introduces the model to a dataset that encompasses multiple tasks at once, such as entity recognition, part-of-speech tagging, and relation extraction. By training across diverse tasks simultaneously, the model develops a more generalized capability, reducing the risk of catastrophic forgetting.

The last task-specific variant is sequential fine-tuning, which involves training the model sequentially on multiple task-specific datasets, one after the other. For example, you might train the model on respiratory disease diagnosis, then on heart disease diagnosis, and finally on cancer diagnosis. Sequential fine-tuning emphasizes the composability of fine-tuning and transfer learning techniques. As such, it is an especially powerful tool for open-source developers who want to stack fine-tunes on top of one another.

Preference-Based Fine-Tuning

Preference-based fine-tuning is a very popular area of research in the LLM fine-tuning space that aims to align LLMs with human preferences. These techniques were originally pioneered by OpenAI in 2022, but have since been adopted by other proprietary organizations like Anthropic and popularized in the open-source community by Meta AI's Llama 2.

To perform preference-based fine-tuning, we first collect human preference datasets in which each example is a triplet of the form (prompt, human's accepted response, human's rejected response). We then apply specialized fine-tuning algorithms, often based on reinforcement learning, to gently steer the model toward generating outputs that align with human preferences. These methods play a necessary role in ensuring that LLMs are not only accurate and helpful but also safe and aligned with human values.

The most popular preference-based fine-tuning techniques include Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). Here is a brief illustration of an end-to-end model training pipeline, from pre-training to supervised fine-tuning to RHLF:

RLHF