cover photo

BLOG · 3/3/2024

Sahar Mariam\'s Level-2 Task Report (part-2)

sahar mariam
sahar mariam
OP
Sahar Mariam\'s Level-2 Task Report (part-2)
This Article is yet to be approved by a Coordinator.

Task 5: Random Forest, GBM and Xgboost\n------------------------------------------------------------------\nUnderstanding Random Forests:\n- Ensemble Learning Method.\n- belongs to bagging family.\n- builds multiple decision trees during training and merges them together to get a more accurate and stable prediction.\n- Each decision tree in the random forest is built on a random subset of the training data, and it introduces randomness both in the data used for training and in the features considered for each split\n \nAlgorithm:\n- Bootstrapped Sampling: For each tree in the forest, a random subset of the training data is sampled with replacement. This process is known as bootstrapped sampling, and it creates multiple diverse training datasets.\n\n- Random Feature Selection:\nAt each node of the decision tree, a random subset of features is considered for splitting. This introduces randomness and helps prevent the dominance of certain features, making the algorithm more robust.\n\n- Decision Tree Construction:\nA decision tree is constructed for each bootstrapped sample using the selected features. The tree is grown recursively by repeatedly splitting nodes based on the feature that provides the best split according to a specified criterion (e.g., Gini impurity for classification, mean squared error for regression).\n\n- Ensemble Building:\nThe above steps are repeated to create a forest of decision trees. Each tree is built independently, and the randomness introduced in the sampling and feature selection ensures diversity among the trees.\n\n- Voting or Averaging:\nFor classification tasks, the final prediction is often determined by a majority vote among the trees. For regression tasks, it involves averaging the predictions of all the trees. This ensemble approach helps to improve the overall accuracy and robustness of the model.\n\n- Feature Importance Calculation:\nAfter training the Random Forest, the importance of each feature is calculated based on how much it contributed to the accuracy of the model. This can be determined by measuring the decrease in impurity (for classification) or the decrease in mean squared error (for regression) caused by each feature.\n\n- Prediction:\nGiven a new input, each tree in the forest makes a prediction, and the final prediction is obtained through the voting or averaging process.\n\n_Implementing Random Forests_ : https://github.com/sahar-mariam/level2-report/blob/main/random_forest.ipynb\n\n**_Understanding GBM:**\n- ML algorithm tha builds a predictive model which is an ensemble of weak learners(decision trees)\n- used for both classification and regression tasks.\n\n-> weak learners are DTs with shallow depth, they give simple predictions, these are combined to create a stronger predictive model.\n\n-> In GBMs, each new tree created is trained to correct the errors made by the combination of its previous DTs.\n\n-> The algorithm minimizes a predefined loss function by adjusting the weights of each weak learner in the ensemble. It uses the gradient of the loss function with respect to the predicted values to guide the training process.\n\n->Popular implementations of GBM include XGBoost, LightGBM, and scikit-learn's GradientBoostingRegressor/GradientBoostingClassifier.\n\n->Gradient Boosting Machines are powerful and widely used for a variety of machine learning tasks due to their ability to handle complex relationships in data and produce high-quality predictions.\n\n\n_Implementing GBM:\nhttps://github.com/sahar-mariam/level2-report/blob/main/gbm.ipynb\n\nUnderstanding XGBoost:\nXGBoost: eXtreme Gradient Boosting is an ML library, implements gradient boosting algorithm.\nXGBoost is an ensemble learning algorithm that belongs to the family of gradient boosting machines\n- developed for speed and performance.\n- used for supervised learning tasks( classification and regression)\n- efficient, flexible and accurate.\n\nFeatures/Characteristics:\n- a type of ensemble technique.\n- handles both categorical and numerical data.\n- prevents overfitting using regularization techniques.\n- uses a technique called tree pruning to control depth of the tree which prevents overfitting.\n \n_Implementing XGBoost:\nhttps://github.com/sahar-mariam/level2-report/blob/main/xgboost.ipynb\n\n--------------------------------------------------------------------\nTask 6 : Hyperparameter Tuning\n-----------------------------------------------------------------\n\nHyperparameter tuning is the process of finding the optimal set of hyperparameters for a machine learning model to achieve better performance.\n\nHyperparameters are external configurations that are not learned from the data but are set before the training process begins.\n\nFeatures of Hyperparameter Tuning:\n\n1. External Configurations:\nParameters vs. Hyperparameters: Parameters are internal variables learned from the data during training (e.g., weights in a neural network). Hyperparameters are external configurations that need to be set before training (e.g., learning rate, number of layers).\n\n1. Impact on Model Performance:\n- Critical Influence: Hyperparameters significantly impact a model's performance, and finding the right values is crucial for achieving optimal results.\n- Model Flexibility: They control the flexibility or complexity of a model. For example, a higher value of the regularization hyperparameter might lead to a simpler model.\n\n1. Trial and Error Process:\n\n- Search for Optimal Values: Hyperparameter tuning involves a search for the optimal combination of hyperparameter values that result in the best model performance.\n- Iterative Process: It's often an iterative process of training, evaluating, and adjusting hyperparameters.\n \nTypes/Approaches of Hyperparameter Tuning:\n\n1. Grid Search:\n\n- Definition: Exhaustively searches a predefined hyperparameter space.\n\n- Process:\n - Defines a set of hyperparameter values to test (a grid).\n - Trains the model with each combination of hyperparameter values.\n - Selects the combination that yields the best performance.\n\n2. Random Search:\n\n- Definition: Randomly samples from a hyperparameter space.\n\n- Process:\n - Randomly selects combinations of hyperparameter values.\n - Trains the model with each sampled combination.\n - Evaluates and selects the combination with the best performance.\n \n- Advantages:\nMore efficient than grid search when only a small fraction of the hyperparameter space is fruitful.\n\nThe choice of tuning method depends on factors such as the size of the hyperparameter space, computational resources, and the nature of the objective function. The iterative process of adjusting hyperparameters based on model performance is crucial for achieving the best possible results.\n\nApplication of Hyperparameter Tuning: \nhttps://github.com/sahar-mariam/level2-report/blob/main/hyperparameter_tuning.ipynb\n\n-------------------------------------------------------------------------------\n\nTask 7: Image Classification using KMeans Clustering\n---------------------------------------------------------------\n1. Understanding K Means Clustering:\n Clustering is one of the most common exploratory data analysis techniques that are used to obtain an intuition about the structure of the data. It is the task of identifying sub-groups in the data such that data points in the same sub-group (cluster) are very similar while data points in different clusters are very different.\n k-means clustering is an unsupervised learning technique that is used when we have unlabelled data. The main goal of this algorithm is to divide the data points in a data set into different categories or groups. The data points are grouped together based on their similarities. k-means tries to partition the data set into k-clusters using an objective function.\n\n k-means clustering algorithm:\n Choose the value of ‘k’, i.e no.of clusters that are to be formed for a given data set.\n Randomly select ‘k’ data points from the data set as the initial cluster centroids.\n For each data item in the data set, compute the distance between the data points and the cluster centroids.\n Assign the data points to the closest centroids and update the centroids by recomputing them with new data points.\n Repeat steps 3 and 4 for all the data points in the data set.\n\n In an image classification problem we have to classify a given set of images into a given number of categories. Training data is available in classification problem but what to do when there is no training data available, to solve this problem we can use clustering to group similar images together. \n\n The steps involved in image classification are:\n The images that are to be classified are imported and converted into arrays.\n Clusters are created. The clusters appear in the resulting image, distinctly dividing the image into a number of parts.\n The number of clusters can be changed to visually validate the image with different colours and decide what closely matches with the required number of clusters.\n Once the clusters are formed, we can recreate the image with the cluster centres and labels to display the image with grouped patterns.\n\n2. Classify a given set of images into a given number of categories using KMeans\nClustering using MNIST dataset:\nhttps://github.com/sahar-mariam/level2-report/blob/main/image_classification_kmeansclustering.ipynb\n\n----------------------------------------------------------\n\nTask 8: SVMs\n----------------------------------------------------------\n_Understanding SVMs:\n\nIn machine learning, Support Vector Machines (SVMs) are a type of supervised learning algorithm used for classification and regression tasks.\nGiven 2 groups of a data in a dataset, SVMs are used to draw a decision boundary or best line which segregates or classifies the grouped data in n-dimensional space, which is further helpful in placing new data points in their respective groups.\nTwo points are selected from each group which are nearest to the decision boundary and these points are called support vectors.\nIt is best to choose points in such a way that they are nearest to the boundary line but also have enough distance from the boundary line.\n\n_Some types of SVMs:\n1. Linear SVM: Linear SVM is used for linearly separable data, which means the data can be separated using a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions). It is commonly used for binary classification.\n\n2. Non-linear SVM: Non-linear SVMs are used when the data is not linearly separable, meaning a straight line or plane cannot effectively separate the data. They use the kernel trick to map the data into a higher-dimensional space where it can be linearly separated.\n\n3. Support Vector Regression (SVR): While traditional SVMs are used for classification tasks, SVR is used for regression tasks. Instead of predicting class labels, SVR predicts a continuous value, making it suitable for tasks like predicting house prices or stock prices.\n\n4. One-Class SVM: This type of SVM is used for novelty detection or anomaly detection. Instead of separating data into multiple classes, one-class SVM identifies and separates the majority of data from rare outliers or anomalies.\n\nWe load the Breast Cancer dataset from scikit-learn and create a DataFrame.\n\nThe data is split into features (X) and the target (y), followed by a further split into training and testing sets.\n\nWe create an SVM classifier with a linear kernel and a regularization parameter (C) of 1.0.\n\nThe classifier is trained on the training data.\n\nWe plot the decision boundary using two specific features, 'mean radius' and 'mean texture', as a 2D representation.\n\nData points are plotted on the graph, color-coded based on their true labels (malignant or benign).\n\nImplementing SVMs: \nhttps://github.com/sahar-mariam/level2-report/blob/main/SVM_breastcancer_detection.ipynb\n\n-------------------------------------------------\n\nTask 9: Anomaly Detection\n-------------------------------------------------\nAnomaly detection is a way to detect erroneous data points in a stream, by looking at statistical differences. \n- Anomalies are abnormal/unusual changes in patterns of a dataset when compared to the entire dataset.\n- Analyzing and catching/detecting these anomalies by observing these patterns statistically or visually is called anomaly detection.\n- Doesn't always result in an error, the results may differ due to these anomalies.\n\nAn anomaly can be broadly classified into different categories:\n- Point Anomalies:\nDefinition: A single data point is considered anomalous.\nExample: A very high or very low value that stands out from the rest of the data.\n\n- Collective Anomalies:\nDefinition: A set of related data points together forms an anomaly.\nExample: Unusual patterns or behaviors that emerge when considering a group of data points collectively.\n\n- Global Anomalies:\nDefinition: An anomaly is present in the overall dataset.\nExample: A sudden and significant change in the overall distribution of data.\n\n \n_Applications of Anamoly Detection in ML:\n1. Disease-Specific Anomaly Detection\n2. Cybersecurity: Detect Network Attacks\n3. Defect Detection in Manufacturing & Packaging industrie \n and Construction\n\nMultiple machine learning algorithms can be used for anomaly detection depending on the dataset size and the type of the problem:\n- Local outlier factor (LOF)\n- DBSCAN\n- Isolation Forest\n- Autoencoders\n- Bayesian networks\n\nSome commonly used techniques for anomaly detection in machine learning:\n- Unsupervised Anomaly Detection:\n - Artificial Neural Networks (ANNs)\n - Isolation Forest\n - Gaussian Mixture Models (GMM)\n\n- Supervised Anomaly Detection:\n - Support Vector Machines (SVM)\n - Random Forests\n - KNN\n \n- Semi-Supervised Anomaly Detection:\n - Pre-trained Models\nTransfer Learning\n\n_Implementing Anomaly Detection:_\nhttps://github.com/sahar-mariam/level2-report/blob/main/Anomaly_Detection_BreastCancer.ipynb\n\n------------------------------------------------------------------

UVCE,
K. R Circle,
Bengaluru 01