cover photo

COURSEWORK

lekha's AI-ML-001 course work. Lv 2

lekha dhAUTHORACTIVE
This Report is yet to be approved by a Coordinator.

AIML Level-1 Report

28 / 7 / 2024


Task 1-Linear and Logistic Regression - HelloWorld for AIML

Linear Regression:

Linear regression is a machine learning technique that predicts a numerical outcome by finding the best-fitting straight line through a set of data points, representing the relationship between one or more input variables and the target variable.

Predicted the price of a home, based on multiple different variables through following steps:

1.Load the required libraries

2.Load the Boston Housing dataset from the original source

3.Prepare the data and target variables

4.Create a Data Frame for better visualization

5.Compute the statistics

6.Print the count and mean

7.Split the data into training and testing sets

8.Initialize the linear regression model

9.Fit the model on the training data

10.Predict on the test data

11.Print the model's performance

click here

Logistic regression:

Logistic regression is a machine learning model used for binary classification tasks. It predicts the probability that a given input belongs to one of two categories by modeling the relationship between the input features and the outcome using a logistic function, which outputs values between 0 and 1.

Trained a model to distinguish between different species of the Iris flower based on sepal length, sepal width, petal length, and petal width as follows:

1.Load the necessary libraries

2.Load Iris Dataset

3.Split Data into Train and Test Sets

4.Create and Train Logistic Regression Model

5.Make Predictions

6.Evaluate Model Performance

7.Plot Decision Boundaries

click here

Task 2-Matplotlib and Data Visualisation

1.Explored the various basic characteristics to plots as given below, with python libraries:

• Set axes label

• Set axes limits

•Create a figure with multiple plots using subplot

•Add a legend to the plot

•Save your plot as png

2.Explore the given plot types:

•Line and Area Plot

•Scatter and Bubble Plot using Iris dataset

•Bar Plot

1.Simple

2.Grouped

3.Stacked

4.Histogram

•Pie Plot

•Box Plot

•Violin Plot

•Marginal Plot

•Contour Plot

•Heatmap

•3D Plot

Multivariate distribution for the Iris dataset using the given dataset for a classification task.

click here

Task 3-Numpy

Numpy is a powerful Python library used in machine learning and data science for numerical computing. It provides support for working with large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.

Using Numpy Generated an array by repeating a small array across each dimension and Generated an array with element indexes such that the array elements appear in ascending order.

click here

Task 4-Metrics and Performance Evaluation

1.Regression Metrics:

Regression metrics are used to evaluate the performance of regression models, which predict continuous outcomes. These metrics help measure how well the model's predictions match the actual values. Here are some simple and commonly used regression metrics.

About Regression Metrics

2.Classifiaction Metrics:

Classification metrics are used to evaluate the performance of classification models, which predict categorical outcomes. These metrics help determine how well the model's predictions match the actual classes.

Here is an example for the classification metricslink

Task 6 - K- Nearest Neighbor Algorithm

1. Implementation of KNN

1.K-Nearest Neighbors (KNN) is a straightforward supervised learning algorithm used for both classification and regression. It works by identifying the 'k' closest data points to a query point based on a distance metric, often Euclidean distance. For classification, it assigns the most common label among these neighbors, while for regression, it averages the values. KNN is non-parametric, meaning it makes no assumptions about the underlying data distribution, but can be computationally intensive with large datasets. Its performance depends on the choice of 'k' and the distance metric, which can affect its accuracy.

Here is an example for Implementation of KNN using sci-kit's neighbors. Implementation

2. Understanding the algorithm:

Here are some important points to understand K-Nearest Neighbors (KNN) in machine learning:

Instance-Based Learning:

KNN is a lazy learning algorithm, meaning it doesn't explicitly train a model but instead stores the training data and uses it for prediction during testing.

Distance Metric:

The algorithm relies on a distance function (commonly Euclidean, Manhattan, or Minkowski) to find the 'k' nearest neighbors to the input point.

Choice of 'k':

The value of 'k' (number of neighbors) significantly affects the model's performance. A small 'k' may lead to overfitting (sensitive to noise), while a large 'k' may over smooth the decision boundary (underfitting).

Non-Parametric:

KNN makes no assumptions about the underlying data distribution, making it flexible for various types of data but potentially inefficient with large datasets.

Computation Cost:

Since KNN stores and checks all data points during prediction, it can be computationally expensive for large datasets, making optimization techniques like KD-trees or Ball trees important in practice.

Scaling and Normalization:

KNN is sensitive to the scale of features, so it is crucial to normalize or standardize data before applying the algorithm.

Applications: KNN:

is widely used for classification tasks, like handwriting recognition and image classification, but can also be applied to regression problems.

3. Implemented a KNN from scratch and compared with sci-kit's built in method for different datasets.

click here

Task 8: Mathematics behind machine Learning

Curve-Fitting:

Curve fitting is the process of finding a mathematical function that best fits a set of data points. In machine learning, this concept is used in algorithms like linear regression to model the relationship between input and output data. By optimizing parameters through methods like least squares, we can minimize the error between predicted and actual values. This demonstrates that machine learning relies on mathematical techniques to learn patterns and make accurate predictions.

Curve-Fitting for a function y1=mx1+b

desmos-graph

Fourier-Transforms:

Fourier series is important in machine learning for breaking down complex, repeating data into simple sine and cosine waves. This makes it easier to find patterns and process signals, like in image analysis or time-series forecasting. By converting data into the frequency domain, machine learning models can handle noise and identify key patterns more effectively.

Screenshot (346) code

Task 9: Data Visualization for exploratory Data Analysis

This is an advanced library, more dynamic than the generally used Matplotlib or Seaborn. Data visualization is essential for Exploratory Data Analysis (EDA) as it reveals data patterns, distributions, and relationships. Techniques like histograms, scatter plots, and box plots help in understanding data and identifying trends. Effective visualizations simplify insights and guide further analysis.

Here is an example for the Data Visualization for Exploratory data Analysis click here

Task 10: An Introduction to Decision Trees

A decision tree in machine learning is a predictive model that makes decisions by splitting data into subsets based on the values of input features. Each internal node represents a feature-based decision, branches represent possible outcomes, and leaf nodes provide the final prediction. It's used for both classification and regression tasks, and is valued for its simplicity and interpretability.

click here

Task 11: SVM

Support vector machines are supervised learning methods to create a non- probabilistic linear model, where a data value is assigned one of two classes to maximize the difference between the two classes. The data points are vectors and a hyperplane between the two classes is selected optimally to maximize the difference and regularize the loss.

1. Understanding Support Vector Machines:

a. Hyperplane: SVMs find the optimal hyperplane that separates classes.

b. Margin Maximization: They maximize the margin between classes.

c. Support Vectors: Support vectors are critical data points closest to the hyperplane.

d. Kernel Trick: Kernels allow handling non-linearly separable data.

e. Regularization Parameter (C): Controls the trade-off between margin size and classification error.

f. Soft Margin vs. Hard Margin: Soft margins allow some misclassifications to handle noisy data.

2. Using the Concept of Support Vector Machines, detected the possibility of breast cancer.

Important Considerations:

Cross-Validation:

Use k-fold cross-validation to ensure that the model generalizes well to unseen data.

Kernel Selection:

If linear separation isn't possible, RBF or polynomial kernels may work better.

Data Imbalance:

Ensure that the dataset is balanced or use techniques like SMOTE if there is a class imbalance.

This method can help in the early diagnosis of breast cancer, allowing medical professionals to intervene with timely treatment plans.

click

UVCE,
K. R Circle,
Bengaluru 01