Neural Text Generation with a Custom GPT
Author: Ivan Bongiorni - 2023-02-05
Open this tutorial on Google Colaboratory.
In this tutorial I will implement a full GPT (Generative Pretrained Transformer). This model will be trained, character by character, on the complete works of Shakespeare. It will therefore generate charachter embedding representations. This will solve the problem of OOV (out-of-vocabulary) tokens.
import os
import requests
import time
import re
import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
from tqdm import tqdm
Set length of text inputs for the model:
INPUT_LENGTH = 128
Download the text dataset containing all Shakespeare’s works:
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
page = requests.get(url)
text = page.text
Let’s take a look at our corpus:
print(text[:147])
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
Let’s vectorize text, mapping every character into an integer to be fed into the Network. This can be done with a character-to-index dictionary:
# Store list of unique characters
unique_chars = list(set(text))
unique_chars.sort()
# Map every letter in our alphabet to an int
char2idx = { char[1]: char[0] for char in enumerate(unique_chars) }
# Produce a reverse dictionary to go back from int to str later
idx2char = { v: k for k, v in char2idx.items() }
# Visualize length of our alphabet
print(len(char2idx))
65
print(char2idx)
{‘\n’: 0, ‘ ‘: 1, ‘!’: 2, ‘$’: 3, ‘&’: 4, “’”: 5, ‘,’: 6, ‘-‘: 7, ‘.’: 8, ‘3’: 9, ‘:’: 10, ‘;’: 11, ‘?’: 12, ‘A’: 13, ‘B’: 14, ‘C’: 15, ‘D’: 16, ‘E’: 17, ‘F’: 18, ‘G’: 19, ‘H’: 20, ‘I’: 21, ‘J’: 22, ‘K’: 23, ‘L’: 24, ‘M’: 25, ‘N’: 26, ‘O’: 27, ‘P’: 28, ‘Q’: 29, ‘R’: 30, ‘S’: 31, ‘T’: 32, ‘U’: 33, ‘V’: 34, ‘W’: 35, ‘X’: 36, ‘Y’: 37, ‘Z’: 38, ‘a’: 39, ‘b’: 40, ‘c’: 41, ‘d’: 42, ‘e’: 43, ‘f’: 44, ‘g’: 45, ‘h’: 46, ‘i’: 47, ‘j’: 48, ‘k’: 49, ‘l’: 50, ‘m’: 51, ‘n’: 52, ‘o’: 53, ‘p’: 54, ‘q’: 55, ‘r’: 56, ‘s’: 57, ‘t’: 58, ‘u’: 59, ‘v’: 60, ‘w’: 61, ‘x’: 62, ‘y’: 63, ‘z’: 64}
At this point, we are ready to vectorize all the corpus:
def numerical_encoding(text, char_dict):
"""
First breaks text into a list of chars, then converts each to
its numerical idx (np.array)
"""
chars_list = [ char for char in text ]
chars_list = [ char_dict[char] for char in chars_list ]
chars_list = np.array(chars_list)
return chars_list
encoded_text = numerical_encoding(text, char2idx)
The next step is to process the vectorized text, creating sequences of length INPUT_LENGTH
.
Given that a GPT is an autoregressive model developed for next-token prediction, you want to produce and input and a target sequence, where a target sequence corresponds to its input sequence but shifter forward of 1 step.
As an example, from the series of tokens:
A, B, C, D, E, F, G, H, I
Assuming an input length of size 4, we’d want to obtain:
Input sequence: Target sequence:
A, B, C, D B, C, D, E
B, C, D, E C, D, E, F
C, D, E, F D, E, F, G
D, E, F, G E, F, G, H
E, F, G, H F, G, H, I
def get_text_matrix(sequence, len_input):
"""
This generates a matrix containing all the sequences
of length INPUT_LENGTH to be fed into the Network
"""
# create empty matrix
X = np.empty((len(sequence)-len_input, len_input))
# fill each row/time window from input sequence
for i in range(X.shape[0]):
X[i,:] = sequence[i : i+len_input]
return X
X = get_text_matrix(encoded_text, INPUT_LENGTH+1)
print(X.shape)
(1115265, 129)
Model Implementation
First, I will specify all the relevant hyperparameters and import the layer and model classes needed from tensorflow
and maximal
.
VOCAB_SIZE = len(char2idx)
BATCH_SIZE = 64
N_EPOCHS = 3
LEARNING_RATE = 10e-5
N_LAYERS = 4
DEPTH = 256
HEADS = 4
FF_NODES = 256
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense
from maximal.layers import PositionalEmbedding, GPTLayer
A Neural Network is a computational graph. I will start specifying its main elements.
A GPT doesn’t work with traditional Embedding()
layers but requires a PositionalEmbedding()
from maximal
. The representation generated by it will then be fed into a sequence of GPTLayer
’s.
Finally, a simple Dense()
layer will “guess”, for each step of the sequence, what the next character is in the form of a probability distribution over the alphabet.
NB: Even though probability distributions are normally learned and produced via softmax gates, choosing the objective function as sparse_categorical_crossentropy()
with the argument from_logits=True
will take care of that, applying softmax under the hood.
# Input layer
input_batch = Input(shape=(INPUT_LENGTH,), dtype=tf.int32)
# Positional Embedding
embedding = PositionalEmbedding(INPUT_LENGTH, VOCAB_SIZE, DEPTH)
# List of GPT Layers
gpt_layers = [ GPTLayer(depth=DEPTH, heads=HEADS, ff_nodes=FF_NODES) for _ in range(N_LAYERS) ]
# Output layer
classification_layer = Dense(VOCAB_SIZE)
Now we can build the computational graph by connecting all its elements together:
x = embedding(input_batch)
for layer in gpt_layers:
x = layer(x)
classification = classification_layer(x)
gpt = Model(
inputs = input_batch,
outputs = classification
)
Training with Custom Loops
In this tutorial, our GPT model is trained with custom training loop. The usual Keras pseudocode would be something such as:
gpt.compile(“adam”, “sparse_categorical_crossentropy”) history = gpt.fit(X, Y, epochs)
but I will build custom training loops instead, to understand and have full control of the process.
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
I will wrap the training step into a function decorated with @tf.function
. This will compile all steps into a single TensorFlow op, making it approximately an order of magnitude faster than plain Python.
@tf.function
def train_on_batch(x, y):
with tf.GradientTape() as tape:
batch_loss = tf.reduce_sum(
tf.keras.losses.sparse_categorical_crossentropy(
y, gpt(x),
from_logits=True)
)
gradients = tape.gradient(batch_loss, gpt.trainable_variables)
optimizer.apply_gradients(zip(gradients, gpt.trainable_variables))
return batch_loss
All is ready for training at this point. The main steps of the process are now:
- At each epoch, reshuffle the dataset to add vary the composition of mini-batches.
- For each iteration, extract a slice of the dataset of size
BATCH_SIZE
, and extract input and target arrays (x chars:[0:128]
, y chars:[1:129]
. - Run
train_step()
on input and target texts. - Periodically print Loss and store its value in
loss_history
loss_history = []
for epoch in range(N_EPOCHS):
start = time.time()
# Reshuffle data at each epoch to randomize mini-batch composition
reshuffle = np.random.choice(X.shape[0], X.shape[0], replace=False)
X = X[reshuffle]
for iteration in range(X.shape[0] // BATCH_SIZE):
# take new minibatch (with 1 char shift from x to y)
take = iteration * BATCH_SIZE
x = X[ take:take+BATCH_SIZE , :-1 ] # chars [0:128]
y = X[ take:take+BATCH_SIZE , 1: ] # chars [1:129]
# training step
current_loss = train_on_batch(x, y)
# periodically store batch loss into history
if iteration%100 == 0:
loss_history.append(current_loss)
print(f"\t{iteration}\tLoss: {current_loss}")
print("{}. \t Loss: {} \t Time: {}ss".format(
epoch+1, current_loss.numpy(), round(time.time()-start, 2)))
# Visualize Loss history
plt.figure(figsize=(15,7))
plt.plot(loss_history)
plt.title('Loss History')
plt.xlabel('Iterations')
plt.ylabel('Loss (Sparse CCE)')
plt.show()
Inference
At this point, the model is ready to generate new text. A specific function is needed for that, with the following arguments:
- A text
prompt
to start the generation. n
number of tokens to be generated.- A
temperature
parameter, governing the amount of noise in sampling the next token. - A parameter that restrict sampling only to the top-
k
most likely tokens.def generate_text(prompt, n=1000, temperature=1.0, k=10): """ Inference time for the GPT. Args: prompt (str): input text n (int): number of tokens to be generated temperature (float): noise in the output probability (>1. = noisy sampling; <1. = conservative sampling.) k (int): restricts to number of top-k tokens to be sampled from """ # If prompt is shorter than INPUT_LENGTH raise error (no padding in this simple tutorial) assert len(prompt) >= INPUT_LENGTH, f"Prompt must be of {INPUT_LENGTH} character length" # If prompt is longer than INPUT_LENGTH crop it to last piece if prompt > INPUT_LENGTH: prompt = prompt[-INPUT_LENGTH:] generated_text = [] for i in tqdm(range(n)): #vectorize prompt and adjust np.array shape vectorized_text = [char2idx[c] for c in prompt] vectorized_text = np.array(vectorized_text).reshape((1, len(vectorized_text))) # next token prediction pred = gpt.predict(vectorized_text, verbose=0) pred = np.squeeze(pred[:,-1,:]) # temperature scaling pred /= temperature # restrict sampling to top k tokens probs, indices = tf.math.top_k(pred, k, sorted=True) # sample token id probs = tf.nn.softmax(probs).numpy() pred_id = np.random.choice(indices, p=probs) # update prompt next_char = idx2char[pred_id] prompt = prompt[1:] + next_char generated_text.append(next_char) generated_text = ''.join(generated_text) return generated_text
I will now feed a piece of the famous Hamlet’s monologue to the GPT and ask it to continue from it:
prompt = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune"""
generation = generate_text(prompt=prompt, n=500, temperature=0.2)
print(generation)
# Finally, save model weights to disk
gpt.save('gpt_shakespeare_13.h5')