cover photo

COURSEWORK

Vikas's AI-ML-001 course work. Lv 1

Vikas V 0330AUTHORACTIVE
This Report is yet to be approved by a Coordinator.

Project report AIML

7 / 2 / 2023


FIRST DAY  
 

My name is Vikas, student at Marvel, and today with all buzz I have started learning at MARVEL UVCE officially.  
I am given a domain of AIML, and yes that’s what I had opted for! Here at marvel, we have levels to climb consisting of many tasks. Today I thought off of starting with level ut no sooner I visited the website, I realized there are some introduction concepts to study before getting on levels. Hence I started the first intro concept i.e. Statistics in AIML.  
 

STATISTICS  
Statistics is all about collecting data, analyzing it and presenting it in an interpretable way. Its of two types: Descriptive and Inferential. Descriptive is all about summarizing data and presenting it using bar graphs, pie charts etc. Mean ,Median, Mode, Standard Variance etc are used in this type. Inferential statistics is about considering sample data and drawing conclusions about it. Concept of Probability is used here. 
Why should u learn Statistics for ML?  
Say, you have developed a product, Statistics tells you what features are important in your product. It tells about the minimal performance required and how you can design strategy for your product. It also tells the most common and expected outcome of your product. So Statistics is a tool for you to be on a safer side.  
 
 
 
Task 1
The data is imported as a pandas dataframe, then we add the house prices to it. This will allow us to manipulate this data mathematically by using inbuilt methods in pandas which is called data cleaning. 
We will start with importing basic modules such as numpy, pandas and matplotlib. We put the data into a data frame which behaves like an 2D array in C, using Pandas. With data frames we can manipulate the data mathematically by using inbuilt methods in pandas which is called data cleaning. 
We will use a set of data that helps us understand the relationship between variables. We import numpy to convert the data from Pandas into a 2D array. Then we can manipulate the data mathematically by adding columns and slicing it in different ways with pd.DataFrame(). 
Linear Regression is a supervised learning algorithm, which means that it works on a set of labeled examples. Before we look at how to use linear regression in Python, let's first learn about the concepts we need to understand before using the appropriate method n 

https://github.com/vikasism/report/blob/main/Task1.md  
 
 
Task1(2)  
In this project we are going to learn Logistic Regression. Logistic regression is modelling the probability of a discrete outcome given an input variable. It says whether something is true or false. The prediction logistic regression is categorical because it will have different classes in iris dataset present in iris.target, and different features based on which they are divided are present in iris.feature_names Logistic regression is used to predict the probability of a discrete outcome given an input variable. In this project, our goal is to optimize the performance of logistic regression by minimizing bias and variance. This can be done by finding out the model with best results on training set subset, while also performing cross validation on testing set.  
Logistic Regression is a supervised classification algorithm. Logistic regression measures the relationship between one or more independent variables (X) and the categorical dependent variable (Y) by estimating probabilities using a logistic(sigmoid) function. Linear regression is not used for classification The steps involved in implementing the model are as follows: https://github.com/vikasism/report/blob/main/TASK1(2).md

 
 

TASK2
Matplotlib and Visualizing Data Matplotlib It is used for basic graph plotting like line charts, bar graphs, etc. It mainly works with datasets and arrays. Matplotlib acts productively with data arrays and frames. It regards the aces and figures as objects. It is more customizable and pairs well with Pandas and Numpy for Exploratory Data Analysis. The different types of plots which can be used to visualize the dataset are as follows: https://github.com/vikasism/report/blob/main/TASK2.md

 
 
TASK3  

Linear regression is a commonly used statistical method that allows us to model the relationship between a dependent variable and one or multiple independent variables. In this report, we will discuss how to implement linear regression from scratch using example images.
 

Step 1: Preparation of data We will use example images in the form of arrays, where each image is represented as a two-dimensional array with rows and columns representing pixels. The dependent variable will be the label associated with each image, and the independent variables will be the pixel values. We will divide the data into a training set and a testing set.

 
Step 2: Initialization We need to initialize the weights and biases of our linear regression model. These values will be updated during the training process to minimize the loss function.
 

Step 3: Training The training process involves iterating over the training data and updating the weights and biases to minimize the loss function. We will use mean squared error as the loss function, which measures the difference between the predicted output and the actual label. The gradient descent algorithm can be used to update the weights and biases in the direction of the minimum loss.

 
Step 4: Testing Once the training process is completed, we can evaluate the performance of the model on the testing data. We will calculate the mean squared error on the test data and use it as a measure of how well the model has learned the relationship between the pixel values and the labels.

 
Step 5: Deployment Finally, we can deploy the model in a real-world scenario by using it to predict the label of new images based on their pixel values.

 

Regression Metrics: In regression problems, we need to evaluate the performance of the model by comparing the predicted values to the actual values. The following regression metrics will be used in this project:

 
Mean Absolute Error (MAE) - measures the average magnitude of the errors in a set of predictions.

 
Mean Squared Error (MSE) - measures the average of the squares of the errors in a set of predictions.

 
Root Mean Squared Error (RMSE) - the square root of the MSE.

 
Evaluation: The performance of each algorithm will be evaluated based on the accuracy of the predictions on the test data. The accuracy is defined as the proportion of correct predictions in the test data.
https://github.com/vikasism/report/blob/main/TASK3.md

 
 

Task 4  
Linear and Logistic Regression - Coding the model from SCRATCH Linear Regression Implementation of linear regression model on a particular dataset involves various functions. The key concepts involved in deriving the linear regression model are as follows
 

The Cost Function: Cost function measures how a machine learning model performs. It is the calculation of the error between predicted values and actual values, represented as a single real number. The difference between the cost function and loss function is as follows: 

 

The cost function is the average error of n-samples in the data (for the whole training data) and the loss function is the error for individual data points (for one training example). The cost function of a linear regression is root mean squared error or mean squared error. They are both the same; just we square it so that we don’t get negative values. 

Given our simple linear equation y=mx+b, we can calculate MSE as:  

Gradient Descent: To minimize MSE we use Gradient Descent to calculate the gradient of our cost function. Gradient descent consists of looking at the error that our weight currently gives us, using the derivative of the cost function to find the gradient (The slope of the cost function using our current weight), and then changing our weight to move in the direction opposite of the gradient. We need to move in the opposite direction of the gradient since the gradient points up the slope instead of down it, so we move in the opposite direction to try to decrease our error.

LOGISTIC REGRESSION

Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes. Implementation of linear regression model on a particular dataset involves various functions.

Sigmoid Function: In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.


https://github.com/vikasism/report/blob/main/TASK4.md
 
 
 
 
 

TASK5

The k-Nearest Neighbor (k-NN) algorithm is a popular and simple machine learning algorithm used for classification and regression problems. It is a non-parametric algorithm that makes predictions based on the majority of the k-nearest data points to a given sample.

 
 
Working of k-NN:
The k-NN algorithm operates on the principle of instance-based learning. It uses a distance metric, such as Euclidean distance, to calculate the similarity between a given sample and the data points in the training set. The k-nearest data points to the sample are selected, and their class labels are used to make a prediction. In classification problems, the majority vote of the k-nearest data points is used to predict the class label of a sample. In regression problems, the average of the k-nearest data points is used to make a prediction.

 

Selection of k:
 
The choice of k is a crucial factor in the k-NN algorithm as it affects the performance of the algorithm. A small value of k results in a high variance model, which is sensitive to noise in the data. A large value of k results in a high bias model, which may miss important patterns in the data.
 
Example:

Let's consider a binary classification problem, where we need to classify a given data point as either class 0 or class 1. The k-NN algorithm is applied to a sample data set with two features, x1 and x2, as follows:
 
The sample data set is split into a training set and a test set.
The k-NN algorithm is trained on the training set and k is set to 3.
The k-NN algorithm is used to make predictions on the test set by finding the 3 nearest data points in the training set and taking the majority vote of their class labels.
The accuracy of the predictions is evaluated and compared to other algorithms.
 
https://github.com/vikasism/report/blob/main/TASK5.md

UVCE,
K. R Circle,
Bengaluru 01