26 / 10 / 2024
TASK 01:
Linear and Logistic Regression - Hello World for AIML
Linear Regression
Linear Regression is a statistical method used to model and analyze the relationship between a dependent variable (output) and one or more independent variables (inputs). It aims to find the best-fitting straight line through the data points, allowing for predictions and insights into how changes in the inputs affect the output.
Logistic Regression
Logistic regression is a statistical method used for binary classification that predicts the probability of a categorical dependent variable based on one or more independent variables. It uses the logistic function to model a binary outcome, where the result is constrained between 0 and 1.
TASK 02: Matplotlib and Data Visualization
MATLAB (Matrix Laboratory) is a powerful computing environment and programming language widely used for numerical computing, data analysis, and visualization. It has built-in functions for matrix manipulations, algorithm implementation, plotting functions, and data visualization, making it popular among engineers, scientists, and researchers.
TASK 03: NumPy
NumPy (Numerical Python) is a powerful Python library that provides support for large, multi-dimensional arrays and matrices, along with a vast collection of mathematical functions to operate on these arrays. It's essential for scientific computing, data analysis, and performing complex numerical computations.
1. Generating an Array by Repeating a Small Array Across Dimensions
This task can be efficiently performed using the
np.tile()
function in NumPy. It replicates a small array along specified dimensions, creating a large matrix or array pattern.
2. Generating an Array with Element Indexes in Ascending Order
This task requires creating an array of indices ordered sequentially. The
np.arange()
function is ideal for this, allowing you to create an array with values in a specified range.
TASK 04: Metrics and Performance Evaluation
1. Regression Metrics
Regression metrics help evaluate the performance of a regression model by comparing predicted values with actual values in a test dataset. Here are some popular regression metrics:
2. Classification Metrics
Classification metrics are essential for evaluating the performance of classification algorithms. Here’s an overview of some common classification metrics:
- Accuracy: Ratio of correctly predicted instances to the total instances.
- Precision: Ratio of true positive predictions to the total predicted positives.
- Recall (Sensitivity): Ratio of true positive predictions to the total actual positives.
- F1 Score: Harmonic mean of precision and recall.
- ROC AUC: Measures the ability of the model to distinguish between classes.
- Confusion Matrix: Visualizes the performance of a classification model.
Task 5: Linear and Logistic Regression - Coding the Model from Scratch
Linear Regression
Linear Regression is a basic and most commonly used type of predictive analysis. It is used to predict the value of a dependent variable based on the value of an independent variable.
Steps:
Understanding Loss Function:
- Utilize Mean Squared Error (MSE) as the chosen loss function for implementation.
- MSE computes the mean of the squared differences between predicted and true values.
Optimization Algorithm:
- Employ Gradient Descent as the optimization algorithm to find optimal parameter values.
- Gradient Descent iteratively updates parameters to minimize the loss function.
Objective:
- Find optimal slope (m) and constant (b) values that minimize the MSE.
Logistic Regression
Logistic regression is a supervised machine learning algorithm used for classification tasks, where the goal is to predict the probability that an instance belongs to a given class or not.
-
Sigmoid Function:
- The sigmoid function converts the linear combination of inputs into a range of probabilities between 0 and 1.
-
Training the Model:
- In this step, the model learns the optimal weights and bias using gradient descent to minimize the loss function.
-
Prediction:
- After training, the model predicts the class labels based on the input features.
-
Generate Synthetic Dataset:
- We generate a synthetic dataset for demonstration purposes.
-
Train the Model:
- We train the logistic regression model on the synthetic dataset.
-
Predict and Calculate Accuracy:
- We predict the labels for the training data and calculate the accuracy of the model.
-
Plot Decision Boundary:
- We plot the decision boundary to visualize how the model separates the two classes in the feature space.
TASK 06: K-Nearest Neighbor Algorithm
Task
- Understand the KNN Algorithm: Learn the basics of how KNN works.
- Implement KNN from Scratch: Write a custom version of the KNN algorithm.
- Use scikit-learn’s KNeighborsClassifier: Test scikit-learn’s built-in KNN on multiple datasets and compare the results to the custom model.
Procedure
-
Define KNN Class: Create a
KNN
class with methods for training and predicting. -
Implement Euclidean Distance: Write a
euclidean_distance()
method to measure distances between data points. -
Fit the Model: The
fit()
method stores the training data within the class for later use. -
Predict Method: The
predict()
method makes predictions for a set of samples, using_predict()
for each one. -
Predict Single Sample: The
_predict()
method finds thek
closest neighbors for one sample and returns the most common label. -
Use Scikit-learn’s KNN: Load datasets (Iris, Digits, Wine), and split them into training and testing sets.
-
Train and Evaluate: Train both the custom KNN and scikit-learn’s
KNeighborsClassifier
on each dataset. Compare their accuracy on the test data. -
Print Results: Display accuracy scores for both models on each dataset for easy comparison.
TASK 07: An Elementary Step Towards Understanding Neural Networks
Neural Networks
Neural networks are computational models inspired by the human brain, consisting of interconnected nodes (neurons) organized in layers that process data and learn from it to perform tasks like classification, regression, and pattern recognition.
Large Language Models (LLMs)
Large Language Models (LLMs) are advanced AI systems that understand and generate human-like text, enabling various natural language processing tasks.
TASK 08: Mathematics Behind Machine Learning
Curve Fitting
Curve fitting finds a mathematical function that best describes the pattern of a set of data points, helping understand trends and make predictions.
Fourier Transforms
The Fourier Transform helps analyze periodic signals by representing them as a sum of sine and cosine waves, which allows us to extract frequency components from the time-domain signal.
Time-Domain Graph
- The x-axis shows time from 0 to 1 second, with 1000 points for each millisecond.
- The y-axis shows the signal's strength, ranging from about -0.5 to +0.5.
- The graph displays a wavy line representing a sine wave that completes 50 cycles in 1 second.
- There are random spikes in the wave caused by noise, making it look less smooth.
- The overall graph mixes the smooth sine wave pattern with these random noise variations.
- This helps us see both the regular wave and the randomness in the signal.
Single-Sided Amplitude Spectrum of a Signal
-
What It Is: This spectrum displays how strong each frequency is in the signal, showing only the positive frequencies.
-
Signal Analysis: It helps us understand which frequencies are present in the signal after performing the Fourier Transform.
-
Positive Frequencies: The spectrum includes only the positive frequencies, as negative frequencies are redundant for real signals.
-
Amplitude Strength: Higher values on the spectrum mean that those frequencies contribute more to the signal's overall shape.
-
X-Axis (Frequency): The horizontal axis shows frequency in hertz (Hz), from 0 to half the sampling rate (known as the Nyquist frequency).
-
Y-Axis (Amplitude): The vertical axis shows how strong each frequency is, derived from the Fast Fourier Transform (FFT) calculations.
TASK 09: Data Visualization for Exploratory Data Analysis
Data visualization for exploratory data analysis (EDA) is the process of using graphical representations to explore, analyze, and understand datasets. It involves creating visual formats such as charts, graphs, and plots to uncover patterns, trends, relationships, and insights within the data. EDA helps identify anomalies, inform hypotheses, and guide further statistical analysis, enabling
- Understand the Data: Familiarize yourself with the dataset’s structure and variable types.
- Preprocess the Data: Clean the dataset by handling missing values and correcting data types.
- Choose Visualization Tools: Select libraries/tools like Matplotlib, Seaborn, or Tableau for visualizations.
- Visualize Univariate Distributions: Use histograms, box plots, and bar charts for individual variables.
- Visualize Bivariate Relationships: Create scatter plots and heatmaps to explore relationships between two variables.
- Visualize Multivariate Relationships: Use pair plots and facet grids to examine interactions among multiple variables.
- Identify Patterns and Insights: Analyze visualizations for trends, correlations, and anomalies.
- Document Findings: Summarize key insights for communication and further analysis.
TASK 10: An Introduction to Decision Trees
What is a Decision Tree?
A decision tree is a tool used to make decisions based on data. It looks like a tree, with a starting point (root), branches (choices), and end points (decisions).
Root: The starting point of the tree, where the first question about the data is asked. Branches: The paths that represent different answers to the question. Leaf Nodes: The final outcomes or decisions at the end of the branches.
Task 11: Support Vector Machines (SVM)
Support Vector Machines (SVM) are supervised learning methods used to create a non-probabilistic linear model. In this approach, a data value is assigned to one of two classes to maximize the difference between the two classes.
Data Representation
- The data points are represented as vectors in a multi-dimensional space. Each dimension corresponds to a feature of the data, allowing SVM to work effectively in high-dimensional datasets.
Hyperplane
- A hyperplane is a decision boundary separating different classes. SVM aims to find the optimal hyperplane that maximizes the margin between the two classes.
Support Vectors
- Support vectors are the data points closest to the hyperplane. They are critical in determining the position and orientation of the hyperplane. The optimal hyperplane is defined by these support vectors.
Kernel Trick
- The kernel trick allows SVM to handle non-linearly separable data by transforming the data into a higher-dimensional space where a linear separation is possible. Common kernels include polynomial, radial basis function (RBF), and sigmoid.