cover photo

BLOG · 6/6/2025

Linear Regression under the hood

vrishank aryan
vrishank aryan
OP
Linear Regression under the hood
This Article is yet to be approved by a Coordinator.

Linear and Logistic regression are the backbone of almost all older ML models. Any model with its decision boundary of the form:
$$Y=W^TX+B$$
(where (Y) is the label, (X) is the feature matrix, (W) is the weight matrix, and (B) is the bias, a scalar value) is called a Linear Model.

Let me quickly walk you through the key differences between Linear Regression and Logistic Regression.

Linear RegressionLogistic Regression
Used to predict a value (label) when given corresponding features.Used to classify between two clusters given features. It gives a boolean output.
Error is measured using Mean Square Error (MSE).Error is measured using Binary Cross Entropy (maximizing negative log likelihood).

Linear Regression

Here's how a computer solves a Linear Regression problem:

You begin with a table of features (X) and labels (Y). Suppose there are M training examples and N-1 features.

$X_1$$X_2$$X_3$$X_4$$Y$
2468114
46810168
9111315343

Here, (M = 3) and (N-1 = 4).

To include a bias, add an extra column filled with 1s to your feature matrix:

$X_0$$X_1$$X_2$$X_3$$X_4$$Y$
12468114
146810168
19111315343

Then initialize a weight vector $W$, often with zeros:

$Bias/W_0$$W_1$$W_2$$W_3$$W_4$
00000

Feature and Label Matrices:

Features (X):

$X_1$$X_2$$X_3$$X_4$
2468
46810
9111315

Labels (Y):

Y
114
168
343

Error Function: Mean Square Error

We use the Mean Square Error (MSE) as our loss function and Gradient Descent to optimize:

$$\text{MSE}=\frac{1}{m}\sum_{i=1}^{m}(Y_i-\hat{Y_i})^2$$

Where:

  • $Y_i$ is the actual label,
  • $\hat{Y_i}=h_\theta(X_i)$ is the predicted label,
  • $h_\theta(X_i)=X_i\cdot W$

Optimizer: Gradient Descent

Update rule:

$$\theta_j:=\theta_j-\alpha\cdot\frac{\partial J(\theta)}{\partial\theta_j}$$

For MSE, this becomes:

$$\theta_j:=\theta_j-\alpha\cdot\sum_{i=1}^{m}(h_\theta(X_i)-Y_i)\cdot X_{ij}$$

You repeat this for every ($\theta_j$) in the weight matrix until convergence.


Visualizing the Loss Surface

In the case of two weights $\theta_1$, $\theta_2$, the loss surface $J(\theta)$ is a convex bowl-shaped curve with a single global minimum. This is why gradient descent works so effectively.

Contour Plot:

Insert image here (e.g., ![Gradient Descent Steps](path_to_image.png))


Final Weights (after optimization)

$Bias/W_0$$W_1$$W_2$$W_3$$W_4$
105728

Verifying Predictions

First training example:

$X_0$$X_1$$X_2$$X_3$$X_4$
12468

Predicted:

$$Y=X\cdot W^T=[1\ 2\ 4\ 6\ 8]\cdot[10\ 5\ 7\ 2\ 8]^T=114$$

Second training example:

$X_0$$X_1$$X_2$$X_3$$X_4$
146810

Predicted:

$$Y=[1\ 4\ 6\ 8\ 10]\cdot[10\ 5\ 7\ 2\ 8]^T=168$$

The predictions match the actual labels - exactly what we want!


Further Reading

I’ve implemented Linear Regression and Gradient Descent from scratch using NumPy in my repo:
from_scratch_implementations
Check out LR.py for the linear regression code and SGD.py for gradient descent!

UVCE,
K. R Circle,
Bengaluru 01