cover photo

COURSEWORK

Vrushank's AI-ML-001 course work. Lv 3

Vrushank R Rao (0344)AUTHORACTIVE
This Report is yet to be approved by a Coordinator.

Vrushank\'s AI-ML level 2 report

14 / 9 / 2023


Task 1 : Linear and Logistic Regression - HelloWorld for AI-ML


  1. Linear Regression - Predict the price of a home, based on multiple different variables.
  2. Logistic Regression - Train a model to distinguish between different species of the Iris flower based on sepal length, sepal width, petal length, and petal width.

Linear and Logistic Regression:

  1. The first step is to import the required libraries:
    • pandas, numpy, matplotlib and seaborn.
  2. Then, we import the dataset, in this case:
    df = pd.read_csv(\'USA_Housing-Copy1.csv\')
    
  3. We separate the X and Y as independent and dependent variables respectively.
  4. From sklearn, we import the following:
    • train_test_split
    • linear regression
    • metrics
  5. We split 60% of our data for training and 40% for testing as we can see.
  6. We appropriately fit the training variables as: lm.fit(X_train, y_train) where lm is an instance of Linear Regression
  7. We make the prediction and then compare our prediction to the actual value using:
    pDf = pd.DataFrame(predictions, y_test, columns = [\'Our Prediction\'])
    
  8. We rate our predictions by using:
    metrics.mean_absolute_error(y_test, predictions)
    metrics.mean_squared_error(y_test, predictions)
    np.sqrt(metrics.mean_squared_error(y_test, predictions))

Logistic Regression:

  1. We import the dataset, in this case: iris_train = pd.read_csv(\'Iris.csv\')

  2. We drop the unnecessary details using: iris_train.drop([\'Id\'], axis=1, inplace=True)

  3. We realise that there's 3 different species:

    • virginica
    • versicolor
    • setosa
  4. Using get_dummies, we convert the species to a binary number that the regression can process - 0 or 1. i.e.,

     species = pd.get_dummies(iris_train[\'Species\'], drop_first=True) 
    

    and we can also implicitly get rid of setosa since we only need 0 or 1.

  5. Similarly, we train 70% and test 30%

  6. We appropriately fit the training variables as: Logmodel.fit(X_train, y_train), where Logmodel is an instance of Logistic Regression

  7. We make the prediction and then compare our prediction to the actual value using: predictions = Logmodel.predict(X_test)

  8. We rate the accurate percentage using:

    print(classification_report(y_test, predictions))
    

and make the confusion matrix and our model is 100% accurate as you can see.

Task 1


Task 2: Matplotlib


  1. Explore the various basic characteristics to plots as given below, with python libraries
  2. Explore the different plot types

Link for task 2

Task 3: Metrics and Perfomance Evaluation

Understand the importance of Regression and classification metrics and performance evaluation and take a structured approach to involving the below standard techniques/ algorithms while training a machine learning model.


  1. Dive in to Task 3 link.

  2. To import metrics, we use: from sklearn import metrics

  3. The different regression metrics are:

    • Mean Absolute Error: Gives the mean of absolute difference between actual value and predicted value.
    • Mean Squared Error : Finding the mean of squared difference between actual and predicted value.
    • Root Mean Squared Error : It just gives the square root of MSE and hence the unit is same as required output variable.
  4. To import classification metrics:

     from sklearn.metrics import classification_report
    
  5. We can then print the classification report and it gives a series of metrics - precision, recall and f1-score

    • Precison: Precision is a measure of how many of the positive predictions made are correct (true positives)
    • Recall: A measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data
    • F1-Score: A measure combining both precision and recall. It is generally described as the harmonic mean of the two.
  6. Confusion Matrix: To import confusion matrix, we use:

    from sklearn.metrics import confusion_matrix
    
  7. A look at confusion Matrix:

Task 4:

Take a linear approach to understand the mathematical implications of regression algorithms, and execute the algorithms from scratch using the given datasets. Compare the accuracies of the model with built-in algorithms using the given datasets for:

  1. Linear Regression.
  2. Logistic Regresssion.

1. Linear Regression:

  1. Generate fake data for modeling
  2. Visualise this data using matplotlib
  3.  def h( ):
     return (w[1]*np.array(X[ ])+w[0])
    
    This creates the h = w1 + w0 X
  4. Define the cost function as shown.
  5. Calculate gradient descent through def grad(w, X, y); def descent(w_new, w_prev, lr):
  6. Then, we initialise the parameters and train our model.
  7. Then, visualise the result

2. Logistic Regression:

  1. Here the sigmoid function is defined, which gives probability of our new input.
  2. The cost function is binary cross entropy
  3. Similarly, calculate gradient descent, initialize parameters and train the model
  4. Visualize the result

this is task 4

Task 5:

Understand the K-Nearest Neighbour Algorithm and implement it, first with a built in interface and next, from scratch. Compare the results for both with the indicated datasets.


  1. Import the required libraries
  • We use a classified data-set using df = pd.read_csv(\Classified Data-Copy1\", index_col=0)
  1. We need to standardize the variables since we don't want too much deviation with each value. from sklearn.preprocessing import StandardScaler

  2. Standardization is the process of calculating the mean and standard deviation for a variable. Then, for each observed value of the variable, you subtract the mean and divide by the standard deviation.

  3. Perform train-test-split as usual.

  4. from sklearn.neighbors import KNeighborsClassifier and create a knn = KNeighborsClassifier(n_neighbors=1) which is an instance of KNeighborsClassifier

  5. We fit and predict the values for X_test.

  6. We can check the accuracy using: print(confusion_matrix(y_test, pred)) and print(classification_report(y_test, pred)), which prints the confusion matrix and classification report.

  7. Elbow method to choose the right k-value:

    • Create a for loop to iterate through each k value and we keep appending the error rate to finally get a graph of Error Rate vs K Value using matplot.
    • Our aim is to select the highest k value for the lowest error
    • We can see that the accuracy has improved from 91% to 94%.

this is task 5

Task 6:

  1. Write a blog about your understanding of Neural Networks and types like CNN, AN netc. Make sure to include any mathematical implication. You can add the function calls used to implement the algorithms.
  2. Learn about Large Language Models at a basic level and make a blog post explaining how you would build GPT-4.

Neural Networks


Task 7:

Deepen your understanding with proper mathematical constructs that act as backbones for the algorithms you write and implement. Perform the below tasks as indicated:

  • Curve-Fitting- Model a curve, fitting for a simple function of your choice, on Desmos.

curve fitting

Task 8: Data Visualization for Exploratory Data Analysis

Use Plotly for data visualization. This is an advanced visualization library, more dynamic than the generally used MatPlotlib or Seaborn.


this is Plotly

Task 9: An introduction to Decision Trees

Decision Tree is a supervised learning algorithm that can be used for Regressive or Classifying Tasks. It is a way to use conditional statements as a hierarchy so that, for an event, you get the chances of given outcomes.


  1. As usual, import the following:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
  2. Load the dataset using:
    df_loans = pd.read_csv(\'loan_data-Copy1.csv\')
    
  3. We've made a matplot for the data items that we might need, i.e., not fully paid column:
    plt.figure(figsize=(1 ))

    sns.countplot(x=\'purpose\ ue=\'not.fully.paid\ ata=df_loan alette=\'Set1\')

red color is 0 - fully paid
else, it is 1 - not fully paid

4. We find the purpose and convert it to a classifiable value - 0 or 1 using: transf_data = pd.get_dummies(df_loans, columns=category_feats, drop_first=True) 5. We separate the X and y variables as independent and dependent, as usual as:

X = transf_data.drop(\'not.fully.paid\', axis = 1)
y = transf_data[\'not.fully.paid\']
  1. Here, we train 70% and test the remaining 30% as shown.
  2. From sklearn.tree, we import DecisionTreeClassifier and create dtree as an instance of DecisionTreeClassifier()
  3. We fit the model as dtree.fit(X_train, y_train) and make the predictions for the X_test
  4. We check the accuracy using classification_report and confusion_matrix
  5. It turns out to be around 75%.
  6. Instead, using random forest classifier, we get around 85% accuracy.

this is Task 9

Task 10: Exploration of a Real world application of Machine Learning

Ford Case Study"

UVCE,
K. R Circle,
Bengaluru 01