BLOG · 3/6/2025

LLMs , Transformers and how to build GPT

A high level walkthrough of llms and transformers and gpt

Anagha Raghavan

LLMs , Transformers and how to build GPT

This Article is yet to be approved by a Coordinator.

LLM – Large Language Models

They are designed to perform a variety of natural language processing tasks, including translation, converting text to voice or image, and many more.

GPT (Generative Pretrained Transformer) is also an LLM. To understand how it is built, we need to understand transformers and how they work.

Transformer

It is a type of neural network that uses self-attention mechanisms to model relationships between tokens, regardless of their positions. Let’s get into the details of how it works:

Embedding – First, the input is split into tokens or pieces present in the model’s vocabulary. Each token is associated with a vector of weights. The embedding matrix is of shape

[𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦𝑠𝑖𝑧𝑒 x 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛]

Attention – Here, interactions between the tokens take place. Every word attends to every other word to understand the context.

To achieve this, three new vectors are formed:

Query (Q) – Asks questions about a token to understand it.

Key (K) – Provides the information relevant to the query.

Value (V) – Contains the actual content or meaning.

The dot product of Query and Key gives a score. The higher the score, the more attention is paid to that token. Applying the softmax function to the scores converts them into attention weights.

For each token, a weighted average of the Value vectors (V) is computed using these attention weights. This result is added to the original embedding, forming an updated vector that captures the contextual meaning of the token.

This process can happen multiple times through multi-head attention, where each "head" captures different aspects of the relationships between tokens.

Multi-Layer Perceptron (MLP) – No token interaction occurs here. Instead, operations are applied individually to each token vector. While the attention block helps each token understand its context, the MLP block helps process and refine this information by exploring deeper, more complex patterns.

This is done as follows:

First, the token vector is expanded to a higher-dimensional space.

An activation function (such as GELU or ReLU) is applied to introduce non-linearity, enabling the model to learn complex patterns.

The output is computed as:

Output = W₂ ⋅ GELU(W₁ ⋅ x + b₁) + b₂

After this, the vector is reduced back to its original size.

Training the Model

The model is trained by working on large amounts of data. It uses backpropagation to adjust weights and biases based on the error in its predictions. This process is repeated billions of times across trillions of words to optimize the model’s performance.

Fine-Tuning, Safety, and Deployment

Safety mechanisms are added to prevent harmful or biased responses. One common technique is Reinforcement Learning from Human Feedback (RLHF), where human reviewers rank model responses to guide the model in learning which outputs are more appropriate. Once the model is sufficiently aligned and tested, it is deployed for real-world use.