LEVEL-2 REPORT
24 / 10 / 2024
TASK 1 : Linear and Logistic Regression
Linear Regression
Linear regression is a fundamental statistical method used for modeling the relationship between a dependent variable and one or more independent variables. It assumes that this relationship is linear,; meaning that changes in the independent variables lead to proportional changes in the dependent variable. The model aims to find the best-fit line that minimizes the sum of squared differences between predicted and actual values. Linear regression is widely applied in various fields,; including economics,; finance,; and science,; to make predictions,; understand correlations,; and infer causal relationships. It serves as a basis for more complex regression techniques and is a valuable tool in data analysis and machine learning.
Github link:https://github.com/umeshsolanki2005/Linear-Regression.git
Logistic Regression
The task involves training a logistic regression model using scikit-learn's linear_model.LogisticRegression to classify different species of the Iris flower based on four key features: sepal length, sepal width, petal length, and petal width. The Iris dataset, a widely used dataset in machine learning, contains measurements of these features for three species of the flower: Iris-setosa, Iris-versicolor, and Iris-virginica. The objective is to build a model that accurately distinguishes between these species by fitting a logistic regression algorithm, which is ideal for classification problems. This task includes data preprocessing, splitting the dataset into training and testing sets, training the logistic regression model, and evaluating its performance through metrics such as accuracy, precision, and recall.
Github Link:https://github.com/umeshsolanki2005/logistic-regression.git
Task 2 - Matplotlib and Data Visualisation
In this task, data visualization is performed on the BigMart dataset using Matplotlib to explore and understand the underlying patterns in the data. Various types of plots are employed to visualize key relationships between features such as item type, outlet size, sales, and visibility. Bar plots are used to illustrate the distribution of item sales across different outlet types and sizes, providing insight into which stores perform better. Histograms reveal the distribution of sales and item visibility, helping to identify any skewness or outliers. Box plots are applied to examine the variation in sales across different item types, detecting potential trends or anomalies. Additionally, scatter plots help visualize the correlation between item visibility and sales, uncovering any possible linear or non-linear relationships. These visualizations offer a comprehensive understanding of the data, aiding in feature selection and model building for predicting sales.
Github Link:https://github.com/umeshsolanki2005/Visualization-using-matplotlib.git
Task 3 -Numpy
The task involves working with arrays using NumPy, a fundamental library in Python for numerical computing. Arrays in NumPy are multidimensional data structures, allowing efficient operations on large datasets. One key feature is the ability to manipulate arrays and generate new ones by repeating smaller arrays across different dimensions. For instance, using functions like np.tile or np.repeat, a small array can be repeated along specified axes to form larger arrays, creating patterns or structures that expand the original array's shape. Another aspect of this task involves generating arrays that contain the element indexes in ascending order, which can be achieved using functions like np.arange or np.indices. These functions allow for creating arrays with elements arranged sequentially, which is useful in various indexing or sorting operations in data analysis tasks.
Github Link:https://github.com/umeshsolanki2005/Numpy.git
Task 4 - Metrics and Performance Evaluation
In the context of the Diabetes dataset using Logistic Regression, key metrics from the scikit-learn metrics package include mean absolute error, mean squared error ,root mean squared error and squared error. These metrics provide insight into the model's performance by evaluating its ability to correctly predict diabetes cases (positive class) versus non-diabetic cases (negative class).
For the Titanic dataset using a classification model ( decision trees classifier), these same metrics from the metrics package help assess the model's ability to predict survival. Accuracy measures overall correctness, precision focuses on the proportion of true diabetic predictions among all positive predictions, and recall captures the proportion of actual survival correctly identified. The F1 score balances precision and recall.
GitHub: https://github.com/umeshsolanki2005/Metric_and_Performance.git
Task 5 - Linear and Logistic Regression - Coding the model from SCRATCH
! Linear regression from scratch
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (features). Implementing linear regression from scratch involves calculating a line of best fit, typically expressed by the equation 𝑦=𝑚𝑥+𝑏 where m is the slope and b is the y-intercept. The goal is to minimize the difference between predicted values and actual values, which is done by minimizing a cost function like Mean Squared Error (MSE) using optimization techniques such as Gradient Descent. By iteratively updating the model's parameters, the line converges to the best fit that captures the linear relationship between the variables.
github: https://github.com/umeshsolanki2005/Linear_regression_from_scratch.git
Logistic Regression from scratch
Logistic regression from scratch on the breast cancer dataset from scikit-learn involves implementing the model without using built-in libraries for the algorithm itself, focusing instead on the underlying mathematical principles. Logistic regression is a linear model used for binary classification, where the sigmoid function maps predicted values to probabilities. By calculating the gradient of the cost function (log-loss) with respect to the model’s weights, gradient descent can be employed to iteratively update these weights and minimize the error. The breast cancer dataset, which includes features like tumor size and cell shape, is well-suited for testing this implementation, as it contains 30 features and labels indicating whether a tumor is malignant or benign.
github:https://github.com/umeshsolanki2005/Logistic_Regression_from_scratch.git
Task 6 - K- Nearest Neighbor Algorithm
In this code, I implement the k-Nearest Neighbors (KNN) algorithm from scratch to classify the Iris flower dataset. I define a custom Euclidean distance function to calculate the distance between data points and a predict_knn function to predict the class labels of test instances based on their nearest neighbors in the training set. The Iris dataset, which contains features such as sepal length, sepal width, petal length, and petal width, is split into training and testing sets. I set the number of neighbors (k=7) for the KNN model, and the custom implementation is compared to the KNN algorithm provided by scikit-learn's library. Both approaches yield the same accuracy score of 96.67%, demonstrating that my implementation correctly replicates the behavior of the KNN model. I also visualize the Iris data using a scatter plot with seaborn, distinguishing between different species.
GitHub: https://github.com/umeshsolanki2005/-K--Nearest-Neighbor-Algorithm.git
Task 7: An elementary step towards understanding Neural Networks
A neural network is a model inspired by the brain, made up of layers of interconnected neurons. It processes data, learns patterns, and adjusts its weights during training to improve accuracy. There are different types: Artificial Neural Networks (ANNs) for general tasks, Convolutional Neural Networks (CNNs) for image data, and Recurrent Neural Networks (RNNs) for sequential data like time series. Each is suited for different applications, making neural networks versatile in solving a variety of problems.
https://hub.uvcemarvel.in/article/f779ec6e-afa9-46d3-827e-3b5f94b5303a
Large-scale language models (LLMs), like GPT and BERT, represent a breakthrough in how computers understand and generate human language. Using transformer architectures, which process entire sentences at once, LLMs can grasp the context of words more effectively than previous models like RNNs. Trained on massive text datasets through unsupervised learning, these models learn to predict words and understand language structure. While LLMs excel at tasks like translation and question-answering, they require significant computational power and can produce biased outputs due to the data they are trained on.
https://hub.uvcemarvel.in/article/3239bb3d-5c57-4ddc-9ec1-e19f0c7ecd4e
Task 8: Mathematics behind machine learning
I gained a basic understanding of curve fitting using Desmos, which allows me to visualize and model relationships in data by adjusting functions to achieve the best fit by given points of data. . Furthermore, I also learned applying the Fourier transformation in MATLAB, which provides a powerful way to analyse the function by transforming them from the time domain to the frequency domain. This ability is essential for understanding how different frequency components contribute to a signal, enhancing my analytical capabilities in both data interpretation and signal processing.
Task 9: Data Visualization for Exploratory Data Analysis
I have learned to work with Plotly for a variety of visualization tasks. I began by creating simple line plots and scatter plots, progressing to more advanced visualizations like 2D contour histograms and scatter plot combinations. I explored pie charts for categorical breakdowns, bar plots to compare data groups, and even 3D scatter plots for visualizing multidimensional data. Additionally, I used Plotly for analyzing datasets, such as visualizing PUBG gun stats and stock data for Apple. I also created a choropleth map to display population data for Washington state, utilizing Plotly's interactive mapping capabilities.
Github: https://github.com/umeshsolanki2005/Plotly.git
Task 10: An introduction to Decision Trees
In this task I have implemented a machine learning model using a Decision Tree to predict student placement based on their degree, skills, and programming language knowledge. The dataset (data.csv) contains four columns: "Degree," "Skills," "programming_lang," and "Placed" (where Placed is the target variable indicating whether a student was placed or not). The categorical features ("Degree," "Skills," and "programming_lang") are first converted into numerical form using LabelEncoder, creating new columns (degree_n, Skills_n, lang_n). The original columns are then dropped. A DecisionTreeClassifier from scikit-learn is used to train the model with the processed inputs and target labels. Finally, the trained model is tested by predicting the placement outcome for a student with a B.Tech degree, DSA skills, and Python programming language, resulting in a positive prediction (1), indicating the student is likely to be placed.
GitHub: https://github.com/umeshsolanki2005/Decision-Tree.git
Task 11: SVM
In this task I have learned how to use Support Vector Machines (SVM) and applied this knowledge by writing a model to detect breast cancer using the popular scikit-learn library. In the model, I loaded the breast cancer dataset, split it into training and testing sets, and applied feature scaling to normalize the data. I trained the SVM model with a linear kernel, though it could be customized with other kernels such as RBF or polynomial. After training, I evaluated the model’s performance using accuracy score, confusion matrix, and classification report. The model achieved a good accuracy, helping to demonstrate SVM’s strength in binary classification tasks.
Github: https://github.com/umeshsolanki2005/Support-Vector-Machines.git