LLMs for dummies — Understanding Large Language Models BERT & GPT

4 min readFeb 7, 2023

LLMs, true to their name, are large models trained on huge text corpus in a self supervised manner so that it develops a general understanding of the language and then for any downstream task such as classification or text generation the model can be extended (as simple as a fully connected layer) and fine tuned on a labelled dataset particular to that task.

Tokenisers

The first step in any NLP task is to convert given text into tokens. LLMs use 2 types — BPE and Wordpiece.

BPE

Initially each character is an individual token. Contiguous pairs of tokens are then sorted by their co-occurrence frequency. The most common pair is now merged to form a new token. The new token’s frequency is subtracted from its individual token’s frequencies. Repeat the steps for a defined number of iterations or until a desired vocabulary size is reached. Each merge is recorded as a rule, which will be used to tokenise text during inference.

WordPiece

Instead of merging the most popular co-occuring pair, WordPiece normalises the frequency with the product of its individual token frequencies. And instead of using rules to merge, for each word, we look at the longest token it contains, split and tokenise the rest of the word.

GPT based models use BPE and BERT based models use WordPiece.

Architecture

Transformer Block

A multi headed self attention layer followed by a fully connected layer, with layer norms after each and a residual skip connection. There is only one fully connected layer for all tokens and not one for each token in the sequence. This allows transformer blocks to take input of any sequence length. The layer norm, normalises across the token dimension.

You can read about multi headed and masked attention here

Encoder

The transformer block’s output shape is the same as the input, so they can be stacked on top of one another to extract high level features without modifying hyper-parameters (similar to a convolution layer with same padding). Transformer blocks stacked up together form an encoder.

Decoder

Decoders are also transformer blocks stacked up together, but differ from encoders in the way they are trained and inferenced.

Decoders are used in generative tasks such as QnA, text summarisation, prompt completion. And this is done in a sequential manner. At step t, given a sequence of t previously generated words as input, the decoder after inference generates the t+1 th word. In the next step the t+1 th word is also appended to the input and the process continues till an End of Sentence token is generated. This is feasible because transformer blocks can act on sequences of any length.

The training can be spend by overcoming the sequential nature of the decoder using masked attention. Since masked attention only allows information flow from the left of each word, we can reframe the sequential text generation task into the task of predicting the next word, which can be trained for in parallel.

Encoder Decoder

In this encoder generates a high level representation of the input and is used as a context vector by decoder when generating the output. Implementation wise, the encoder vector is used to compute key and value vectors in the “encoder-decoder attention” blocks. The decoder still has its own self attention blocks (which are of course masked).

BERT — Bidirectional Encoder Representation Transformer

It is an encoder only model. There are 2 versions with no change in architecture but just hyper parameters. The base and large come with 12 and 24 transformer blocks, 768 and 1024 token dimension, 12 and 16 attention heads respectively.

It is trained on 2 tasks.

Masked Language Model — 15% of the words in the input are at random replaced with a [MASK] token or a random word or not replaced. The network is then tasked to predict these words. This allows the network to use context from both left and right directions to predict the word.
Next Sentence Prediction — Given 2 sentences the network is trained to classify whether the second sentence belongs to the same context of the first one, forcing the network to learn long term context.

GPT — Generative Pre-trained Transformer

It is a decoder only model with 12 transformer blocks, 768 token dimension and 12 attention heads. Unlike BERT, GPT was trained simultaneously on a unsupervised dataset on the generative task and supervised datasets (albeit of smaller size) on tasks like classification, similarity and MCQ answering.

The future versions of GPT had no changes in the architecture and were just bigger models trained on far larger datasets. GPT2 is a 10x bigger model with 48 transformer blocks and 1600 token dimension. GPT3 is 100x bigger model with 96 transformer blocks and 12888 token dimension.

The huge size allowed these models to develop “zero shot task transfer”. Given examples of task, followed by a question as a prompt, the model was able to solve it. For example if the model was given the prompt “ English Sentence 1: French Translation 1 :: English Sentence 2” it would output French Translation 2

Note that this is zero shot because the model was never trained or fine tuned for the translation task i.e no gradient updates to the weights. It was only provided with examples in the input.