Neural Text Generation with a Custom GPT

Author: Ivan Bongiorni - 2023-02-05

Open this tutorial on Google Colaboratory.

In this tutorial I will implement a full GPT (Generative Pretrained Transformer). This model will be trained, character by character, on the complete works of Shakespeare. It will therefore generate charachter embedding representations. This will solve the problem of OOV (out-of-vocabulary) tokens.

import os

import requests
import time
import re

import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
from tqdm import tqdm

Set length of text inputs for the model:

INPUT_LENGTH = 128

Download the text dataset containing all Shakespeare’s works:

url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
page = requests.get(url)
text = page.text

Let’s take a look at our corpus:

print(text[:147])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

Let’s vectorize text, mapping every character into an integer to be fed into the Network. This can be done with a character-to-index dictionary:

# Store list of unique characters
unique_chars = list(set(text))
unique_chars.sort()

# Map every letter in our alphabet to an int
char2idx = { char[1]: char[0] for char in enumerate(unique_chars) }

# Produce a reverse dictionary to go back from int to str later
idx2char = { v: k for k, v in char2idx.items() }

# Visualize length of our alphabet
print(len(char2idx))

65

print(char2idx)

{‘\n’: 0, ‘ ‘: 1, ‘!’: 2, ‘$’: 3, ‘&’: 4, “’”: 5, ‘,’: 6, ‘-‘: 7, ‘.’: 8, ‘3’: 9, ‘:’: 10, ‘;’: 11, ‘?’: 12, ‘A’: 13, ‘B’: 14, ‘C’: 15, ‘D’: 16, ‘E’: 17, ‘F’: 18, ‘G’: 19, ‘H’: 20, ‘I’: 21, ‘J’: 22, ‘K’: 23, ‘L’: 24, ‘M’: 25, ‘N’: 26, ‘O’: 27, ‘P’: 28, ‘Q’: 29, ‘R’: 30, ‘S’: 31, ‘T’: 32, ‘U’: 33, ‘V’: 34, ‘W’: 35, ‘X’: 36, ‘Y’: 37, ‘Z’: 38, ‘a’: 39, ‘b’: 40, ‘c’: 41, ‘d’: 42, ‘e’: 43, ‘f’: 44, ‘g’: 45, ‘h’: 46, ‘i’: 47, ‘j’: 48, ‘k’: 49, ‘l’: 50, ‘m’: 51, ‘n’: 52, ‘o’: 53, ‘p’: 54, ‘q’: 55, ‘r’: 56, ‘s’: 57, ‘t’: 58, ‘u’: 59, ‘v’: 60, ‘w’: 61, ‘x’: 62, ‘y’: 63, ‘z’: 64}

At this point, we are ready to vectorize all the corpus:

def numerical_encoding(text, char_dict):
    """
    First breaks text into a list of chars, then converts each to
    its numerical idx (np.array)
    """
    chars_list = [ char for char in text ]
    chars_list = [ char_dict[char] for char in chars_list ]
    chars_list = np.array(chars_list)
    return chars_list

encoded_text = numerical_encoding(text, char2idx)

The next step is to process the vectorized text, creating sequences of length INPUT_LENGTH.

Given that a GPT is an autoregressive model developed for next-token prediction, you want to produce and input and a target sequence, where a target sequence corresponds to its input sequence but shifter forward of 1 step.

As an example, from the series of tokens:

A, B, C, D, E, F, G, H, I

Assuming an input length of size 4, we’d want to obtain:

 Input sequence:        Target sequence:
 A, B, C, D             B, C, D, E
 B, C, D, E             C, D, E, F
 C, D, E, F             D, E, F, G
 D, E, F, G             E, F, G, H
 E, F, G, H             F, G, H, I

def get_text_matrix(sequence, len_input):
    """
    This generates a matrix containing all the sequences
    of length INPUT_LENGTH to be fed into the Network
    """
    # create empty matrix
    X = np.empty((len(sequence)-len_input, len_input))

    # fill each row/time window from input sequence
    for i in range(X.shape[0]):
        X[i,:] = sequence[i : i+len_input]

    return X

X = get_text_matrix(encoded_text, INPUT_LENGTH+1)
print(X.shape)

(1115265, 129)

Model Implementation

First, I will specify all the relevant hyperparameters and import the layer and model classes needed from tensorflow and maximal.

VOCAB_SIZE = len(char2idx)
BATCH_SIZE = 64

N_EPOCHS = 3
LEARNING_RATE = 10e-5

N_LAYERS = 4
DEPTH = 256
HEADS = 4
FF_NODES = 256

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense

from maximal.layers import PositionalEmbedding, GPTLayer

A Neural Network is a computational graph. I will start specifying its main elements.

A GPT doesn’t work with traditional Embedding() layers but requires a PositionalEmbedding() from maximal. The representation generated by it will then be fed into a sequence of GPTLayer’s.

Finally, a simple Dense() layer will “guess”, for each step of the sequence, what the next character is in the form of a probability distribution over the alphabet.

NB: Even though probability distributions are normally learned and produced via softmax gates, choosing the objective function as sparse_categorical_crossentropy() with the argument from_logits=True will take care of that, applying softmax under the hood.

# Input layer
input_batch = Input(shape=(INPUT_LENGTH,), dtype=tf.int32)

# Positional Embedding
embedding = PositionalEmbedding(INPUT_LENGTH, VOCAB_SIZE, DEPTH)

# List of GPT Layers
gpt_layers = [ GPTLayer(depth=DEPTH, heads=HEADS, ff_nodes=FF_NODES) for _ in range(N_LAYERS) ]

# Output layer
classification_layer = Dense(VOCAB_SIZE)

Now we can build the computational graph by connecting all its elements together:

x = embedding(input_batch)

for layer in gpt_layers:
    x = layer(x)

classification = classification_layer(x)

gpt = Model(
    inputs = input_batch,
    outputs = classification
)

Training with Custom Loops

In this tutorial, our GPT model is trained with custom training loop. The usual Keras pseudocode would be something such as:

gpt.compile(“adam”, “sparse_categorical_crossentropy”) history = gpt.fit(X, Y, epochs)

but I will build custom training loops instead, to understand and have full control of the process.

optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)

I will wrap the training step into a function decorated with @tf.function. This will compile all steps into a single TensorFlow op, making it approximately an order of magnitude faster than plain Python.

@tf.function
def train_on_batch(x, y):
    with tf.GradientTape() as tape:

        batch_loss = tf.reduce_sum(
            tf.keras.losses.sparse_categorical_crossentropy(
                y, gpt(x),
                from_logits=True)
        )

    gradients = tape.gradient(batch_loss, gpt.trainable_variables)
    optimizer.apply_gradients(zip(gradients, gpt.trainable_variables))
    return batch_loss

All is ready for training at this point. The main steps of the process are now:

At each epoch, reshuffle the dataset to add vary the composition of mini-batches.
For each iteration, extract a slice of the dataset of size BATCH_SIZE, and extract input and target arrays (x chars: [0:128], y chars: [1:129].
Run train_step() on input and target texts.
Periodically print Loss and store its value in loss_history

loss_history = []

for epoch in range(N_EPOCHS):
    start = time.time()

    # Reshuffle data at each epoch to randomize mini-batch composition
    reshuffle = np.random.choice(X.shape[0], X.shape[0], replace=False)
    X = X[reshuffle]

    for iteration in range(X.shape[0] // BATCH_SIZE):

        # take new minibatch (with 1 char shift from x to y)
        take = iteration * BATCH_SIZE
        x = X[ take:take+BATCH_SIZE , :-1 ]  # chars [0:128]
        y = X[ take:take+BATCH_SIZE , 1: ]   # chars [1:129]

        # training step
        current_loss = train_on_batch(x, y)

        # periodically store batch loss into history
        if iteration%100 == 0:
            loss_history.append(current_loss)
            print(f"\t{iteration}\tLoss: {current_loss}")

    print("{}.  \t  Loss: {}  \t  Time: {}ss".format(
        epoch+1, current_loss.numpy(), round(time.time()-start, 2)))

# Visualize Loss history

plt.figure(figsize=(15,7))
plt.plot(loss_history)
plt.title('Loss History')
plt.xlabel('Iterations')
plt.ylabel('Loss (Sparse CCE)')
plt.show()

Inference

At this point, the model is ready to generate new text. A specific function is needed for that, with the following arguments:

A text prompt to start the generation.
n number of tokens to be generated.
A temperature parameter, governing the amount of noise in sampling the next token.

A parameter that restrict sampling only to the top-k most likely tokens.

def generate_text(prompt, n=1000, temperature=1.0, k=10):
  """
  Inference time for the GPT.

  Args:
      prompt (str): input text
      n (int): number of tokens to be generated
      temperature (float): noise in the output probability
          (>1. = noisy sampling; <1. = conservative sampling.)
      k (int): restricts to number of top-k tokens to be sampled from
  """
  # If prompt is shorter than INPUT_LENGTH raise error (no padding in this simple tutorial)
  assert len(prompt) >= INPUT_LENGTH, f"Prompt must be of {INPUT_LENGTH} character length"

  # If prompt is longer than INPUT_LENGTH crop it to last piece
  if prompt > INPUT_LENGTH:
      prompt = prompt[-INPUT_LENGTH:]
    
  generated_text = []

  for i in tqdm(range(n)):
      #vectorize prompt and adjust np.array shape
      vectorized_text = [char2idx[c] for c in prompt]
      vectorized_text = np.array(vectorized_text).reshape((1, len(vectorized_text)))

      # next token prediction
      pred = gpt.predict(vectorized_text, verbose=0)
      pred = np.squeeze(pred[:,-1,:])

      # temperature scaling
      pred /= temperature

      # restrict sampling to top k tokens
      probs, indices = tf.math.top_k(pred, k, sorted=True)

      # sample token id
      probs = tf.nn.softmax(probs).numpy()
      pred_id = np.random.choice(indices, p=probs)

      # update prompt
      next_char = idx2char[pred_id]
      prompt = prompt[1:] + next_char
      generated_text.append(next_char)

  generated_text = ''.join(generated_text)
  return generated_text

I will now feed a piece of the famous Hamlet’s monologue to the GPT and ask it to continue from it:

prompt = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune"""

generation = generate_text(prompt=prompt, n=500, temperature=0.2)
print(generation)

# Finally, save model weights to disk
gpt.save('gpt_shakespeare_13.h5')