20 / 7 / 2024
I used Kaggle to write and execute the codes for both linear and logistic regression models, and followed tutorials as listed on the MARVEL website.
1. Linear regression
I used a dataset detailing the housing prices in California (due to the fact that the Boston dataset was deleted, due to the creators being racist upon investigation, and the dataset being deemed tainted).
The code is as follows:
2. Logistic Regression
The dataset used was the Iris dataset, the same one used in the instructional sites and videos.
The code is as follows:
Various python libraries may be used for data visualisation, but the two most common ones used are Matplotlib and Seaborn. I'm trying to stick primarily to matplotlib.pyplot, just for familiarity's sake.
I read through a chunk of the NumPy documentation and got tired of combing through it in detail, so I perused the internet search option and wrote the following code.
Evaluation metrics are mathematical calculations/parameters used to find out how accurate a certain model's results are. These can be done using SciKitLearn, which contains all the basics we need.
1. Regression Metrics For regression metrics, we calculate mean squared error, root mean squared error, mean absolute error, r-squared score, and more. I didn't do more, though. I only did these four, on the California Housing Prices task completed above.
2. Classification Metrics Classification metrics deal with metrics pertaining to models focussing on multiple classes. We calculate accuracy, precision, recall, F1 score and ROC-AUC score.
1. Linear Regression Model
This model is built using only NumPy. Our main objectives are to minimise the errors in predictions made by the model, and continuously adjust the parameters of the model until our cost (variation b/w values and the line drawn) is low, meaning the difference between actual and predicted values are low. Things to keep in mind:
First, we'll take our loss function as MSE.
Second, we'll use the gradient descent algorithm (described above).
We're using y=mx+b!
Then, we can implement the model and see how it fares using performance metrics. 8
2. Logistic Regression Model
Logistic regression model is basically the same as the linear one, if it got a confidence boost to include more data (bigger range) and give confident predictions at the extremes of the sigmoid curve. Things to keep in mind:
First, our loss function is cross entropy.
Second, we'll continue to use the gradient descent algorithm again.
This is another classification model which essentially works on the principle of sahavaasa-dosham. It looks at a number (k) of its neighbouring points (where k is an integer) and predicts the probability of the given point belonging to a certain class based on the majority class of its neighbours.
Decision tree is another type of predictive algorithm which comes under supervised learning, much like kNN, linear and logistic regression, all of which I've covered in previous tasks. This method, when boiled down, can be described as a decision-making process based on 'maps' that are made, from one choice to another.
For example, let's take the way that we ourselves take decisions, maybe something as simple as lunch for our understanding right now. The first question would be, "Is it lunchtime/am I hungry?". If the answer is yes, we'll then go down one line of thought— Maybe something like, "What do I want to eat?". If the answer isn't a yes, we'll think about something else. It could be about when to have lunch, or about a different task at hand to be done. In either case, each question we pose to ourselves has a certain set of answers, which will each go a certain route, or branch a certain way. These may interlap with each other, or they may be independent. They may also circle back.
The processing that our minds just did, even with a task as simple as deciding lunch, is a type of decision tree. It can also be thought of as multiple switch statements, if we think about it!
Now in more official, ML-y terms, we can say that a decision tree works by breaking down data into smaller and smaller subsets, until it has enough to make a prediction, just like we asked ourselves more detailed questions about lunch, until we could finally decide on what to eat.
A decision tree consists of a root node (Am I hungry? The primary data which will be split), branches (which are the pathways to the the next split data), the decision nodes (the smaller data, or more detailed questions: What will I eat? Cheese or a grape?) and finally, the leaf nodes (the final predictions: I will eat cheese despite being lactose intolerant).
The process is basically what I described before: The tree first selects one feature to best split the data. Then, with questions which will help it pick the most optimal answer, it'll split this data into smaller data. This repeats until it reaches its final predictions at the leaf nodes.
There are two types, as mentioned in the intro:
It is possible to build a decision tree on Python from scratch if we're aware of the parameters used for splitting, but since I do not have the time to look that up at the moment, we will skip the actual code and I will plug it in from ChatGPT. However, the basic steps are:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
data = load_iris()
X = data.data
y = data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Make predictions and evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Plotly is a much more dynamic/interactive tool that we can use to visualise data. It's similar to Matplotlib, but on co— an energy drink. In any case, I use mostly the same datasets that I used for task 2 and messed around with some functions.