Keerthi's AI-ML-001 course work.

This Report is yet to be approved by a Coordinator.

Level 2 - Final Report

2 / 2 / 2026

Task 1: Decision Tree Using the ID3 Algorithm

In this task, I implemented a Decision Tree Classifier using the ID3 algorithm, which selects attributes based on Information Gain (IG) to build the tree. The measure of impurity in the dataset was computed using Entropy, defined as:
entropy where (p_i) is the probability of class (i). At each step, the Information Gain of an attribute was calculated as:
gain By recursively choosing the feature with the maximum gain, the dataset was split into purer subsets, resulting in an effective decision tree model. Link ID3

Task 2 - Naive Bayesian Classifier

The Naive Bayes Classifier is a simple yet powerful probabilistic model based on Bayes’ Theorem, assuming conditional independence among features. It is widely applied in text classification, spam filtering, and sentiment analysis due to its efficiency with large datasets. In implementation, GaussianNB was used for the Iris dataset, as it handles continuous numerical features by assuming a normal distribution, while MultinomialNB was applied for spam detection since it works well with discrete features like word counts. This demonstrates the flexibility of Naive Bayes across different domains, making it a practical and effective classification algorithm.Link naive

Task 3 - Ensemble Techniques

In this task, the goal was to predict whether a passenger survived the Titanic disaster.
I first handled missing values in age, fare, and embarked, then converted categorical data like sex into numbers and created useful features such as familySize, alone, and categories for age and fare.

For the model, I used a Stacking Classifier where Decision Tree and Random Forest worked as base models and Logistic Regression combined their results.

This approach improved the prediction accuracy and showed how feature preparation and combining models can give better results. Link

Task 4 - Random Forest, GBM and Xgboost

1.Random Forest

Random Forest is an ensemble learning method used for classification and regression. It works by building multiple decision trees on different subsets of data and combining their predictions to improve accuracy and reduce overfitting. In this task, I used a Random Forest classifier to predict heart disease. I split the dataset into training (70%) and testing (30%) sets, trained an initial model, and then applied GridSearchCV to tune hyperparameters like max_depth, n_estimators, and min_samples_leaf. The optimized model achieved an improved accuracy of 0.699. I also visualized some decision trees and analyzed feature importance, finding age, sex, cholesterol, and blood pressure as the key predictors. Link

2.Gradient Boosting

Gradient Boosting is a powerful machine learning technique that builds decision trees one after another, with each new tree learning from the mistakes of the previous ones. This way, the model gradually becomes more accurate and can capture complex patterns in the data.

For this task, I used a Gradient Boosting Classifier to predict breast cancer. I split the dataset into 80% training and 20% testing data. The model was trained with 100 trees, a learning rate of 0.1, and a maximum depth of 2, which balanced accuracy and complexity. On the test set, it achieved an accuracy of 95%, performing very well for both classes. I also looked at feature importance and visualized the top 10 features, which helped identify the most influential factors in the predictions.Link

3.XGBoost

XGBoost is an optimized gradient boosting algorithm designed for speed and performance. It builds an ensemble of decision trees sequentially, where each new tree tries to correct the errors of the previous ones, improving predictive accuracy and handling complex data patterns efficiently. In this task, I applied an XGBoost Classifier to the Pima Indians Diabetes dataset, splitting it into training (67%) and testing (33%) sets. The model was trained on the training data and tested on the test set, achieving an accuracy of 72.44%. Link

Task 5 - Hyperparameter Tuning

Hyperparameter tuning helps improve a model’s performance by optimizing parameters that are not learned from data. I trained Logistic Regression and Decision Tree models on a classification dataset. For Logistic Regression, I used Grid Search to find the best regularization parameter. For Decision Tree, I applied Randomized Search to tune parameters like max_depth, min_samples_leaf, and criterion. The models were evaluated using cross-validation, which improved their predictive accuracy and reliability.Link

Task 6 : Image Classification using KMeans Clustering

Image classification is a key application of machine learning that assigns labels to images based on their content. I applied KMeans Clustering on the MNIST dataset to group similar digit images into clusters. Each image was flattened and normalized, and the model was trained to find optimal cluster centroids. Using functions to infer cluster labels, the clusters were mapped to actual digits and evaluated for accuracy. Finally, I visualized the centroids as images and tested the model on unseen data, achieving strong classification performance.Link

Task 7 : Anomaly Detection

Anomaly detection identifies unusual or erroneous data points in a dataset by analyzing statistical differences. I applied a One-Class SVM on the Iris dataset, using sepal length and width as features. The data was scaled to standardize the features before training the model. The model learned to separate normal points from anomalies, and predictions were made for the entire dataset. Finally, the results were visualized, highlighting normal points in green and anomalies in red, showing the effectiveness of the approach. Link

Task 8 : Generative AI Task Using GAN

Generative Adversarial Networks (GANs) are models that can generate realistic synthetic data by training two networks—the Generator and the Discriminator—in opposition. I implemented a GAN using PyTorch to generate images from the CIFAR-10 dataset. The Generator creates fake images from random noise, while the Discriminator learns to distinguish real from fake images. Both networks were trained together over multiple epochs to improve image realism. Finally, I visualized generated images, showing the GAN’s ability to produce diverse and high-quality synthetic images.Link

Keerthi's AI-ML-001 course work. Lv 3