AIML Level 2 Report
26 / 10 / 2024
Task 1: Linear and logistic regression - Hello world for AIML
Linear regression finds a straight line that best fits the data, helping you predict future values. For house prices, you provide features like the number of rooms, crime rate, etc., and the model predicts the price. By splitting the data and evaluating the model’s performance, you can understand how well it makes predictions.
Steps to Predict House Prices with Linear Regression:
- Load the Data: Get your house data with features like crime rate, number of rooms, and prices.
- Pick Features: Select important features that affect price, like crime rate, rooms, and age.
- Split the Data: Divide into training (80%) and testing (20%) sets.
- Train the Model: Use the training data to teach the model how features impact prices.
- Make Predictions: Use the model to predict prices on the testing data.
- Evaluate: Check how well the model did using Mean Squared Error (MSE) and R-squared. Link
What is Logistic Regression?
Logistic Regression is a statistical method used for binary classification problems, meaning it predicts one of two possible outcomes. However, it can also be extended to multi-class classification (like our Iris flower case) where the goal is to classify data points into more than two categories.
Steps to Solve the Problem : Iris dataset
- Load the Data: Read the Iris dataset from a CSV file using pandas so we can work with it.
- Understand the Data: Look at the first few rows of the dataset to see what the data looks like, including the features and the target species.
- Prepare the Data:
- Define Features: Select the columns (features) that we will use for prediction, which are the measurements of the flowers (sepal length, sepal width, petal length, and petal width).
- Define Target: Choose the column that contains the species names (the result we want to predict).
- Scale the Features: Adjust the feature values to a similar scale using a technique called standardisation. This helps the model learn better and faster.
- Split the Data: Divide the data into two parts:
- Training Set: About 80% of the data to train the model.
- Test Set: About 20% of the data to test how well the model works.
- Create the Model: Set up the Logistic Regression model using a library like scikit-learn.
- Train the Model: Fit (train) the model on the training set. This is where the model learns from the data.
- Make Predictions: Use the trained model to predict the species of flowers in the test set.
- Evaluate the Model: Check how accurate the predictions are by comparing them to the actual species in the test set. Print out the accuracy and a detailed report to see how well the model performed. Link
Task 2: Matplotlib and data visualisation
Multivariate Distribution
Multivariate distribution shows how two or more related things (like height and weight) behave together. It helps us see patterns and relationships between these things in a dataset.
Clustering
Clustering is about putting similar items into groups. For example, if we have different types of flowers, clustering helps us find which ones are alike based on their features, like size or color, without needing to label them first.
Basic Plot Characteristics
- Axes Labels: Use
plt.xlabel()
andplt.ylabel()
. - Axes Limits: Use
plt.xlim()
andplt.ylim()
. - Multiple Plots: Use
plt.subplot()
. - Legend: Use
plt.legend()
. - Save Plot: Use
plt.savefig('filename.png')
.
Plot Types
- Line & Area Plot :
plt.plot()
;plt.fill_between()
. - Scatter & Bubble Plot :
plt.scatter()
; size for bubbles. - Bar Plot :
plt.bar()
for simple, grouped, and stacked. - Histogram :
plt.hist()
. - Pie Plot :
plt.pie()
. - Box Plot :
plt.boxplot()
. - Violin Plot :
plt.violinplot()
. - Contour Plot :
plt.contour()
. - Heatmap :
sns.heatmap()
. - 3D Plot : Use
mpl_toolkits.mplot3d
.
Task 3: Numpy
What is NumPy?
- NumPy (Numerical Python) is a library used to handle large, multi-dimensional arrays and matrices.
- It provides mathematical functions to operate on these arrays efficiently, far faster than using Python’s built-in data structures like lists.
np.tile
for Repeating Arrays
- Small Array: You start with a 2x2 array,
small_array = [[1, 2], [3, 4]]
. - Repeating:
np.tile
repeats this array across dimensions, creating a larger array. In this case,(2, 3)
means the array is repeated 2 times vertically (along the rows) and 3 times horizontally (along the columns).
For example:
[[1, 2, 1, 2, 1, 2], [3, 4, 3, 4, 3, 4], [1, 2, 1, 2, 1, 2], [3, 4, 3, 4, 3, 4]]
np.argsort
for Sorting Indices
- Original Array: You define an array
arr = [50, 20, 30, 10, 40]
. - Sorted Indices:
np.argsort(arr)
returns the indices that would sort the array. It doesn't change the original array but tells you the order of indices that would produce a sorted array.- In this case, the smallest element (10) is at index 3, so the first value in
sorted_indices
will be 3. This produces the sorted indices:[3, 1, 2, 4, 0]
.
- In this case, the smallest element (10) is at index 3, so the first value in
Link
Task 4: Metrics and Performance evaluation
Regression metrics are used to evaluate the performance of models that predict continuous values.
Example: California Housing Dataset
The California Housing dataset contains census data from 1990 used to predict housing prices based on features like average income and house age. In regression analysis, Linear Regression models the relationship between independent variables and the continuous target variable (median house value), evaluated using metrics :
- Mean Absolute Error (MAE): Measures the average error between predicted and actual values.
- Mean Squared Error (MSE): Squares the errors, giving more weight to larger mistakes.
- Root Mean Squared Error (RMSE): Provides error on the same scale as the target variable.
- R-squared (R²): Indicates how much variance in the target variable is explained by the model. Classification metrics evaluate the performance of models that predict discrete class labels.
Example:Wine dataset:
Contains chemical and physical properties of wines from three cultivars, with 13 features like alcohol content and acidity.
- Accuracy: Measures overall correctness of the model.
- Precision: Shows the proportion of true positives among predicted positives.
- Recall: Reflects the model's ability to find actual positive cases.
- F1 Score: Balances precision and recall for better assessment.
- Confusion Matrix: Displays true/false positives and negatives to evaluate classification performance. Link
Task 5: Linear and Logistic Regression - Coding the model from scratch
Linear Regression:
- Goal: Fit a line that best predicts a continuous target variable y from an input x.
- Approach: We use the Gradient Descent algorithm to minimize Mean Squared Error (MSE), which helps in finding the optimal slope and intercept.
- Implementation: We coded a custom LinearRegressionScratch class with functions to:
- fit(): Calculates the slope and intercept based on training data.
- predict(): Uses the learned parameters to make predictions on new data.
- Testing and Plotting: We compare the predictions of this custom model against scikit-learn's Linear Regression using a sample dataset, plotting the results for visual comparison.
Logistic Regression:
- Goal: Classify binary outcomes (0 or 1) based on input features.
- Approach: We apply the Sigmoid function to transform the linear output into a probability, making predictions based on a threshold (0.5).
- Implementation: The LogisticRegressionScratch class includes: fit(): Uses Gradient Descent to optimize weights and bias. predict(): Classifies each data point based on the sigmoid-transformed output.
- Testing and Plotting: We compare the accuracy of this scratch model against scikit-learn’s Logistic Regression on a generated binary dataset.
Link
Task 6: K- Nearest neighbor algorithm
K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and regression tasks. It is simple, intuitive, and non-parametric, meaning it makes no assumptions about the underlying data distribution.
Key Steps in KNN:
- Storing Training Data: KNN doesn’t learn a model explicitly. Instead, it stores all the training data.
- Calculating Distance: When a new data point (test sample) is given, KNN calculates the distance between this test point and all the points in the training dataset. The most commonly used distance metric is Euclidean distance, but others like Manhattan or Minkowski can also be used.
- Finding Nearest Neighbors: The algorithm selects the k nearest neighbors (training samples) that are closest to the test point based on the distance calculated. Majority Voting (for Classification): For classification tasks, KNN assigns the label that is most frequent among the k nearest neighbors. This is called majority voting.
- Averaging (for Regression): For regression tasks, the average of the k nearest neighbors' values is taken as the predicted output.
Link
Task 7: An elementary step towards understanding Neural Networks
Blog post on Nueral networks : Link
Blog post on LLM and building GPT-4 : Link
Task 8: Mathematics behind machine learning
Curve fitting is a process of finding a mathematical function that best approximates a given set of data points. The goal is to model the underlying trend or relationship between variables, typically by minimizing the difference between the predicted values from the function and the actual data points. Link
Fourier Transforms
- Fourier Transforms are powerful mathematical tools used to analyze and decompose signals into their constituent frequencies. They allow us to represent complex signals in the frequency domain, providing insights into their frequency content. Using techniques like the Fast Fourier Transform (FFT), we can efficiently compute the frequency spectrum of a signal. Link
Task 9 : Data visualisation for exploratory data analysis
We use Plotly, a dynamic visualization library, to explore and analyze data. The task involves creating several interactive plots to uncover patterns and insights from the data.
Key Steps:
- Loading Data: We load the Iris dataset, which contains features like sepal length, petal length, etc.
- Scatter Plot: This visualizes the relationship between sepal length and petal length, with species color-coded.
- Histogram: Shows the distribution of sepal width across different species, helping us understand the frequency and spread.
- Box Plot: Highlights the distribution of petal width, showcasing outliers and variability for each species.
- Heatmap: Displays a correlation matrix for numeric features, showing relationships (positive or negative) between them.
- Line Plot: Tracks changes in petal length against sepal length, illustrating trends across species.
Plotly allows these visualizations to be interactive (e.g., zoom, hover), providing a deeper exploration of the data.
Link
Task: 10 - Decision trees
A Decision Tree is a supervised learning algorithm for regression and classification tasks. It resembles a flowchart that divides a dataset into smaller subsets, where each node represents a test on an attribute (e.g., whether a customer’s age is greater than 30), branches represent outcomes, and leaf nodes indicate class labels or final outcomes.
Key Concepts:
- Root Node: The initial split point based on the most significant feature.
- Decision Nodes: Points where the dataset is split based on conditions.
- Leaf Nodes: End points for final predictions or classifications. Splitting: Dividing a node into sub-nodes based on conditions.
- Gini Index/Entropy: Metrics used to assess split quality (purity of nodes). Pruning: Removing nodes that contribute little to model accuracy to avoid overfitting.
How Decision Trees Work:
- Training Phase: The tree is built by recursively splitting the dataset based on features, selecting those that enhance performance using Gini Index or Entropy.
- Prediction Phase: To predict outcomes for new data, you follow the tree's conditions until reaching a leaf node for the prediction. Link
Task 11 : SVM
Support Vector Machines (SVM)
SVM is a supervised learning algorithm used for binary classification. It identifies the optimal hyperplane that separates data points from two classes while maximizing the margin, which is the distance between the hyperplane and the nearest data points from each class, known as support vectors.
Steps to Implement SVM for Breast Cancer Detection
- Load the Dataset: Use the Breast Cancer Wisconsin dataset with tumor features and labels (malignant or benign).
- Split the Data: Divide the dataset into training and testing sets.
- Train the SVM Model: Create and train an SVM classifier on the training data.
- Make Predictions: Use the model to predict the test data and evaluate its accuracy and performance metrics. Link