1 / 11 / 2024
I used a dataset on California housing prices since the Boston housing dataset was removed after being deemed tainted due to concerns about bias in its creation. For guidance, I explored some fantastic tutorials on MARVEL’s website and followed along with a YouTube video by 'NeuralNine'. I quite enjoyed coding it and understanding its working.
Here's the Code:
Linear Regression Code Here
To get a good understanding of what this actually does, I searched YT and did this.
I used to Iris dataset - which I learnt was introduced by Ronald Fisher in 1936, hehe - to train a model on logistic regression. I learnt that "Logistic regression" is a decision-making model that learns to classify data into categories, such as "yes/no" or "0/1". It does this by analyzing patterns in the data and making predictions based on these patterns. After training, its predictions are compared to actual outcomes to assess how accurately the model performs.
Here's the Code:
Logistic Regression Code Here
I started exploring data visualization with Matplotlib using Google Colab to understand its features better. For the final report, though, I’ll switch to Kaggle to present my findings in a clearer way. Here’s the code I used:
Through this exercise, I learned to manipulate arrays in NumPy by implementing two key operations: using tile() to create repeated patterns from a small array, and converting a one-dimensional sequence of numbers (60-89) into a structured 6x5 matrix using reshape(). The code for the following is here:
So... during this task, I learned that metrics are basically ways to measure how well our model is performing. For classification problems, I learned about accuracy (how many predictions were correct overall), precision (out of all the times we predicted "yes," how many were actually "yes"), recall (out of all the actual "yes" cases, how many did we catch), and F1-score (a balanced measure between precision and recall). These help evaluate the overall performance of the model.
Classification metrics are tools that measure how well our model predicts categories - basically showing us how many times it got the right answer (accuracy), how reliable its predictions are (precision), and how many correct answers it actually caught (recall).
Code:
Regression metrics help measure prediction accuracy: MAE and RMSE show average error size, while R-squared indicates overall model fit.
Code:
Linear Regression: Used to model the relationship between a continuous dependent variable and one or more independent variables by fitting a linear equation. It’s widely used for predicting numeric values based on input features.
Logistic Regression: Used for binary classification problems where the outcome is either 0 or 1. It estimates the probability that a given input point belongs to a certain class, using the sigmoid function.
The Prediction Plot for linear regression shows the predicted values from the custom OLS model, Gradient Descent, and Scikit-Learn, allowing for a visual comparison of how well each model fits the data. For logistic regression, a ROC Curve was plotted to compare the true positive rate against the false positive rate. The Residual Plot for linear regression displays the errors for the custom OLS model, providing insights into model accuracy and the distribution of residuals.
I learned that K-Nearest Neighbors (KNN) is an easy-to-understand method used to classify or predict values based on data. It works by finding the k
nearest points to a new data point, using a way to measure distance, like how far apart they are. This distance is usually measured in straight lines, called Euclidean distance.
Code:
When I was watching YT about KNN, I came across channel 'IBM' where they, to illustrate the working of KNN, gave an interesting example. And I tried to do the same with Plotly.
Code:
The Fourier Transform (FT) is a mathematical operation that transforms a time-domain signal into its frequency-domain representation. It's used in decomposing a signal into its constituent frequencies, allowing us to analyze its frequency components and their amplitudes. (I'd done this in Matlab, initially.)
Code:
I learned how to use the regression tool in Desmos to fit a straight line to data points with the equation y = mx + b.
Way to Desmos
In exploring data visualization, I also used Plotly, which provided an interactive platform for creating dynamic visuals as opposed to Static Plots by Matplotlib. In Plotly I explored features like hover information and zoom capabilities that were very cool, catchy and fun.
Decision Trees are a simple yet powerful tool used in machine learning for both classification and regression tasks. They work by asking a series of questions (based on features of the data) that lead to a prediction.
A Decision Tree splits the data into subsets based on the best feature at each step. It continues this process until no further improvement can be made. The result is a tree structure where:
Decision Trees are used in areas like finance (credit scoring), healthcare (diagnosis), and marketing (customer segmentation).
In short, Decision Trees break down complex problems into simpler, understandable decisions, but they require careful management to avoid overfitting.