What follows is an annotated guide to Andrej Karpathy's video on building GPT from scratch from his seminal series Zero to Hero series. In this guide, you'll cover the following:
The guide includes a few alterations to spice things up but the core ideas and code persist. In future posts, I'll expand on or dive deeper into some of the core concepts covered.
Let's dive in.
Let's start by setting up the libraries we need to gather data and build our model. We'll be using Hugging Face datasets to access data for training our model.
I've chosen the UltraTextbooks dataset - a reservoir of long-form text samples - from Sebastian Raschka.
Note you'll need an access token to access any gated datasets - so I've left the notebook login cell there for you - otherwise, feel free to pick and choose any dataset you like.
!pip install -qqq datasets
from huggingface_hub import notebook_login
notebook_login()
For our little experiment, we're going to start with 50 texts. You can experiment with different data sizes depending on how much time and compute you have to play around with.
from datasets import load_dataset
# make sure you have run notebook_login() for gated datasets
dataset = load_dataset(
"Locutusque/UltraTextbooks",
split="train",
streaming=True
)
# sample 50 texts
sample_data = [next(iter(dataset)) for i in range(10)]
print(f"Number of samples: {len(sample_data)}")
print(sample_data[0])
After we gather the dataset, we're ready to build the training corpus. Generally speaking, GPT models learn from gigantic text corpuses in the training phase called pretraining. The pretraining step produces what is often to referred to as a Foundation Model
or Base Model
.
Individual samples aren't much use to us, especially in list form. Instead, we want to concatenate the text into a large contiguous corpus. We'll use list comprehension to do this.
In real-world settings, we would use a special token - like <[END_OF_TEXT]> - instead of a newline character to separate texts so the model learns to distinguish between different textbooks.
# build corpus from dataset
text = "\\n".join([data['text'] for data in sample_data])
print("corpus length", len(text))
It might surprise those new to the space but machine learning models cannot understand text as we see it. Instead, they understand text in the numerical form which leads us to the next step in preparing our data - tokenization.
For tokenization to work, we need an encoder and decoder to alternate the text between numerical and character form. Crucially, we need a vocabulary for this to work. Beautifully, Python allows us to do this in a couple lines of code.
# setup the vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create lookups for encoding and decoding
stoi = { ch:i for i,ch in enumerate(chars)}
itos = { i:ch for i,ch in enumerate(chars)}
# tokenizers
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join(itos[i] for i in l)
print(encode("hello")) # [49, 46, 53, 53, 56]
print(decode(encode("hello"))) # hello
As always - that time in the machine learning lifecycle has come - where we split our dataset into training and validation splits. We'll hold out the validation split for evaluating our model after training.
And lastly, let's decode a sample to ensure we transform our tokens back.
import torch
# train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # we'll beging with a 90% split
train_data = data[:n]
val_data = data[n:]
print(len(train_data)) # 64079
print(len(val_data)) # 7120
Great stuff. That concludes data gathering and simple data preprocessing. State-of-the-art implementations go far beyond the techniques we've used, but this should do us just fine. We're now ready to build our GPT model from scratch.