The recent advances demonstrated by Large Language Models, most notably ChatGPT, have shown how valuable these technologies can be in assisting with daily tasks. Whether its crafting email, optimizing marketing materials, creating exercise programs, cooking, or coding, LLMs have the ability to accelerate tasks.
ChatGPT didn’t magically acquire these skills; it underwent an intricate training process that involved learning from vast datasets and fine-tuning its abilities.
Crucially, the recent breakthrough owes much of its success to the invaluable feedback provided by humans which we’ll explain later on.
Data Scientists and Machine Learning Engineers often emphasize the importance of high-quality data, which includes accurate, relevant, and well-structured information that fuels the model’s learning process.
While ChatGPT and Claude are impressive, they are indeed generalists. The unique opportunity for anyone sitting on years of hard-earned knowledge is their ability to transform these generalist models into experts in a chosen domain.
In this article, I’ll break down the key LLM training methods to provide a reference for understanding how to train them. By doing so, I hope to establish a common language and understanding of how you can prepare your data for the training options at your disposal.
Large Language Models are continuously evolving. The figure below shows the exponential growth in submitted papers referencing the technology.
For now, at least, a canonical training pipeline — thank you Sebastian Raschka — has crystallized. This pipeline has led to the groundbreaking performance of ChatGPT and Llama 2. Generally speaking, these models undergo a three-step process:
First, the LLM absorbs an enormous corpus of unlabeled text called pretraining or self-supervised training. Next, the model is fine-tuned using labeled examples to improve the quality of responses. Finally, the alignment step teaches the LLM to respond with greater prowess and reliability.
The way that the LLM learns initially is by absorbing billions, or even trillions, of tokens — a block at a time — effectively performing the task of next-word prediction. This technique is called self-supervised learning. This method is remarkably powerful allowing the model to grasp language as we know it.