cover photo

COURSEWORK

Shashank's AI-ML-001 course work. Lv 1

Shashank T SAUTHORACTIVE

Level -1 Report

22 / 1 / 2023


Task 1-report \n\n## Linear Regression \n\nLinear Regression is a supervised machine learning algorithm that is used to establish a relationship between variables and make predictions. It is represented by an equation of the form Y = a + bX, where Y is the dependent variable, X is the explanatory variable, b is the slope of the line and a is the intercept. Linear Regression is considered as a fundamental algorithm in machine learning and uses basic modules such as numpy, pandas and matplotlib. The Boston dataset from sklearn.datasets is loaded into a variable and converted into a data frame using Pandas for mathematical manipulation and cleaning. The house prices are then added to it. \n\n \n\n \n\n## Logistic Regression \n\nLogistic Regression is a statistical method used to model the probability of a categorical outcome, such as true or false, given an input variable. It is used to make predictions about discrete outcomes. To start with logistic regression, we load the required modules and the iris dataset from sklearn.datasets. The different classes in the iris dataset are present in the variable 'iris.target' and the different features based on which the data is divided are present in the variable 'iris.feature_names'. The data is then divided into training and testing sets using the train_test_split function, where the parameter 'test_size' is used to specify the percentage of data used for testing the model. A logistic regression instance is created and the model is fit using the training data. The results are tested using a confusion matrix and the accuracy was found to be 97%, with only one incorrect prediction of a versicolor class being classified as virginica. \n\n \n\n \n\n \n\n \n\n# Task 2-report \n\n## Data Visualisation \n\nMatplotlib is a plotting library in Python used for creating basic graphs such as line charts, bar graphs, scatter plots, bubble plots, and histograms. It is highly customizable and pairs well with other libraries such as Pandas and Numpy for Exploratory Data Analysis. Matplotlib offers a wide range of plot types including line and area plots, bar plots, pie plots, box plots, violin plots, marginal plots, contour plots, heat maps, and 3D plots. Each plot type is designed to visualize a specific aspect of the data, helping you to choose the most appropriate plot for your analysis. \n\n \n\n \n\n \n\n# Task 3-report \n\n## Numpy Library \n\nNumpy is a powerful library for working with arrays in Python. One useful feature of Numpy is the ability to repeat a small array across each dimension, allowing for the creation of larger arrays with a specific pattern. This can be achieved using the numpy.repeat() function, which takes in the input array, the number of repetitions for each element, and the axis along which to repeat. For example, given a small array A = [1, 2, 3], we can repeat it along the first axis 3 times to generate a new array B = [[1, 2, 3], [1, 2, 3], [1, 2, 3]]. We can also repeat it along the second axis 2 times to generate a new array C = [[1, 1], [2, 2], [3, 3]]. Another useful feature of Numpy is the ability to generate an array with element indexes such that the array elements appear in ascending order. This can be achieved using the numpy.argsort() function, which returns the indices of the elements of an array in sorted order. For example, given an array A = [3, 2, 1], the function numpy.argsort(A) will return the array [2, 1, 0], which are the indices of the elements in the original array in ascending order. In summary, Numpy provides several useful functions for working with arrays, such as repeat() and argsort(), that can be used to generate new arrays with specific patterns or sort the elements of an array in ascending order. These functions can greatly simplify the process of working with arrays in Python and save time for the developer. \n\n# Task 4-report \n\n## Metrics and Performance Evaluation \n\nThe top_k_accuracy_score function is a generalization of accuracy_score. a prediction is considered correct as long as the true label is associated with one of the k highest predicted scores. The accuracy_score function computes either the fraction (default) or the count (normalize=False) of correct predictions. acc = np.sum(predictions == y_test)/len(y_test) The classification_report function builds a text report showing the main classification metrics. precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples The F-measure is a weighted harmonic mean of the precision and recall. A Fβ measure reaches its best value at 1 and its worst score at 0 The hamming_loss computes the average Hamming loss or Hamming distance between two sets of samples max_error function computes the maximum residual error , a metric that captures the worst case error between the predicted value and the true value. Mean absolute error: function computes mean absolute error, a risk metric corresponding to the expected value of the absolute error loss or l1-norm loss. R² score represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance \n\n# Task 5-report \n\n## Linear and Logistic Regression - Coding the model from SCRATCH \n\nImplementing a linear regression model on a specific dataset involves several steps. The key concepts involved in creating the linear regression model include the cost function, gradient descent, and the various functions and libraries used in the implementation. The cost function measures the performance of a machine learning model and is represented as a single real number, calculated as the difference between predicted and actual values. The cost function for linear regression is typically the root mean squared error or mean squared error. To minimize this cost function, gradient descent is used to calculate the gradient of the cost function and adjust the weights accordingly. The steps involved in implementing the linear regression model include: \n1. Creating a LinearRegression class and setting the learning rate and number of iterations \n2. Defining the fit method and implementing the gradient descent function \n3. Defining the predict method to take in new test samples \n4. Importing necessary libraries \n5. Splitting the data into training and testing sets using the train_test_split function \n6. Fitting the data into a dataframe and predicting the test values \n7. Defining the cost function as mean squared error and calculating the model's accuracy \n8. Visualizing the data with a scatter plot The implementation of logistic regression, which is used for predicting binary outcomes, involves similar steps and concepts, but with adjustments to account for the logistic sigmoid function and the categorical nature of the output. \n\n \n\n# Task 6-report \n\n## K-Nearest Neighbor Algorithm \n\nThe k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning method used for both regression and classification problems. The algorithm groups an individual data point based on its proximity to other similar points. Regression problems use the average of the k nearest neighbors to make a prediction, while classification uses discrete values. To make a classification, the distance between the query point and other data points must first be calculated using a distance metric such as Euclidean distance. For classification, the majority class label of the K nearest neighbors from the training dataset is assigned as the predicted class for the new data point. In regression, the mean or median of the continuous values assigned to the K nearest neighbors from the training dataset is used as the predicted value for the new data point. KNN can be implemented using the scikit-learn library's neighbors.KNeighborsClassifier for multiple suitable datasets. The steps involved include importing necessary libraries, loading the dataset, splitting the data into training and testing sets, and calculating the accuracy of the model using the confusion matrix. KNN can also be implemented from scratch, which involves defining the euclidean_distance function, the KNN class with fit, predict, and _predict methods, and comparing the results with the built-in scikit-learn method for different datasets.

UVCE,
K. R Circle,
Bengaluru 01