BLOG · 16/6/2025

Language Learning Models

Nilima Sharma

This Article is yet to be approved by a Coordinator.

GPT-4

ChatGPT-4:

GPT-4 is a large multimodal language model built by OpenAI. It can read and generate natural language, code, math, and even understand images—all from a single prompt box. **GPT‑4 also writes poems, debugs code, passes law school exams, explains quantum physics, and critiques Shakespeare—with flair.

Specialities:

It has more parameters, data and training time.
It can handle upto 128,000 tokens at a time.
It is multi-lingual and muti-modal, which can understand images and text together.

How it is trained:

GPT-4 is trained using 3 steps:

Generative pre-training: reading through the internet data and training on a massive dataset consisting of books, articles, websites, code, posts, blogs, etc.
Supervised Fine-Tuning: Expert AI-trainers write high-quality prompts and ideal responses, aligning it much better than human expectations. It helps the model to be coherent and follow clear instructions, rather than just ramble off predictions.
Reinforcement Learning from Human Feedback: GPT-4 is trained to give better answers based on human responses.

How I would build Chat-GPT 4:

GPT-4 works on 3 key ideas, i.e.,

Tokens: bricks of language that help to make the model math-friendly
Self-attention: The model looks at the entire range of words
Transformer: Helps to scale data and fine tune results. Embedding – First, the input is split into tokens or pieces present in the model’s vocabulary. Each token is associated with a vector of weights. The embedding matrix is of shape [𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦𝑠𝑖𝑧𝑒 x 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛] Attention – Here, interactions between the tokens take place. three new vectors are formed:
1. Query (Q) – Asks questions about a token to understand it.
2. Key (K) – Provides the information relevant to the query.
3. Value (V) – Contains the actual content or meaning.

Each word checks how closely it relates to other words by dotting its query with every key. Then, it turns those scores into percentages using softmax(probabilities) . These percentages decide how much of each word’s value to borrow, creating a new, more informed version of the word.

Multi-Layer Perceptron (MLP) – No token interaction occurs here. Instead, operations are applied individually to each token vector. While the attention block helps each token understand its context, the MLP block helps process and refine this information by exploring deeper, more complex patterns.

An activation function (such as GELU or ReLU) is applied to introduce non-linearity, enabling the model to learn complex patterns.

The output is computed as:

Output = W₂ ⋅ GELU(W₁ ⋅ x + b₁) + b₂

After this, the vector is reduced back to its original size.

I would scrape data of the internet, add vetted books, code, academic papers and multiple sources to train the model on.
 Later on, the data would be filtered, i.e., spam or copyrighted snippets or duplicates would be removed. GPT-4's tokenizer would squeeze out words, code and emojis into a single message response.
 
GPT-4 will consist of multiple hundred layers stacked, almost a trillion connections. More the layers, more smarter the model is. More neurons work in a wider layer hence the model skeleton must be **wider** and **deeper**.

Instead of using the entire brain of the model for every word analysis, only most relevant parts are actually activated. This helps **save energy**, **speed up processing** and **make the model much more efficient**.
 Coming to the hardware part, GPT-4 would use tens of thousands of powerful GPU's or TPU's connected with ultra-fast internet(400GB/s). 
 It would use the Adam Optimizer to help learn. Errors get avoided and model speeds up. Learning rate reduces over time. 
 To save memory, Gradient checkpointing and mixed precision would be used. It saves up data like checkpoints to avoid running out of memory. A neural network would be used during training to do forward and backward passing to learn from previous mistakes(Gradient checkpointing).
 Smaller numbers are used to speed things up and reduce memory use(Mixed Precsion).