Nithish N S - AIML Level 2 Tasks Report
2 / 2 / 2026
Nithish N S - AIML Level 2 Tasks Report
TASK 1: Decision Tree based ID3 Algorithm
In this exercise, I learned how to implement the ID3 algorithm from scratch to build a decision tree. Using the classic Play Tennis dataset, I calculated entropy to measure data impurity and information gain to find the best attributes for splitting the data at each node. This process of recursively selecting the most informative attribute, like 'Outlook' first, then 'Wind' or 'Humidity', allowed me to construct a predictive model.

TASK 2: Naive Bayesian Classifier
By training a Naive Bayes model on a small set of labeled emails, I was able to teach the machine to recognize the subtle difference between a friendly lunch invitation and a suspicious "win a prize" scam. It was satisfying to see my code take a brand-new sentence it had never encountered before and correctly flag it as spam, proving that even a simple mathematical approach can effectively interpret human language.

TASK 3: Ensemble techniques
In this task, I explored ensemble techniques, which combine multiple machine learning models to improve overall predictive performance. Using the Titanic dataset, I preprocessed the data and implemented Random Forest (Bagging) and Gradient Boosting algorithms. The code demonstrated that these ensemble methods significantly outperformed a single Decision Tree by effectively reducing variance and bias to achieve higher accuracy.

TASK 4: Random Forest, GBM and Xgboost
Random Forest
Random Forest is a Bagging (Bootstrap Aggregation) technique. It builds hundreds of decision trees simultaneously and independently. Each tree looks at a random subset of the data and a random subset of features. Because the trees are independent, they can be trained in parallel. The final prediction is the average (for regression) or majority vote (for classification) of all the trees. Its main strength is reducing variance, making it very stable and resistant to overfitting without much tuning.

GBM - Gradient Boosting Machines (The "Sequential" Approach)
GBM is a Boosting technique. Unlike Random Forest, it builds trees sequentially (one after another). Each new tree is specifically designed to correct the errors (residuals) made by the previous trees. It starts with a weak prediction and iteratively improves it by focusing strictly on the hard-to-predict cases. Its main strength is reducing bias, often achieving higher accuracy than Random Forest, but it is more prone to overfitting and harder to tune.

XGBoost - eXtreme Gradient Boosting (The "Optimized" Approach)
XGBoost is not a fundamentally different technique from GBM; rather, it is a highly optimized implementation of Gradient Boosting. It takes the mathematical principles of GBM and adds system-level optimizations (parallel processing, cache awareness) and algorithmic enhancements (regularization to prevent overfitting, handling missing values). It is famous for being incredibly fast and winning many Kaggle competitions because it balances the high accuracy of boosting with better speed and robustness than standard GBM.

TASK 5: Hyperparameter Tuning
Hyperparameter tuning is essentially the "knob-turning" phase of machine learning. While the model learns the data, hyperparameters are the settings you choose before training to control how that learning happens.
The simplest way to understand this is through a Grid Search, which systematically tries every combination of settings you provide to find the best performer.
Key Concepts in the Code
-
param_grid: This is your menu. GridSearch will multiply these out. Since we have 3 options for n_estimators and 3 for max_depth, it will train 9 different models.
-
Cross-Validation (cv=5): To make sure the "best" settings aren't just a fluke, the code splits the training data into 5 parts, training and testing 5 times for every combination.
-
The "Winner": grid_search.best_params_ tells you exactly which configuration yielded the highest accuracy.
TASK 6: Image Classification using KMeans Clustering
This project uses K-Means clustering to group handwritten digits from the MNIST dataset.
First, the images are loaded with PyTorch and converted into NumPy arrays for processing. After visualizing sample digits to understand the data, the K-Means algorithm groups the images based on their pixel patterns. Since this is unsupervised learning, the model doesn't use labels; it simply puts similar-looking digits into the same clusters. Finally, by comparing these clusters to the actual labels, we can see how well the algorithm recognized the digits on its own.
TASK 7: Anomaly Detection
Anomaly detection (or outlier detection) is the identification of rare items, events, or observations which raise suspicions by differing significantly from the majority of the data.
Below is a robust, ready-to-use Python example using the Isolation Forest algorithm from the scikit-learn library. This is one of the most popular and effective methods for general anomaly detection because it isolates anomalies rather than profiling normal data points.
TASK 8: Generative AI Task Using GAN
A Generative Adversarial Network (GAN) is a class of machine learning frameworks where two neural networks the Generator and the Discriminator compete against each other in a zero sum game.
The Dynamic Duo
-
The Generator: Its goal is to create realistic data (like images or music) from random noise. Think of it as a "counterfeiter" trying to create a masterpiece.
-
The Discriminator: Its job is to distinguish between real data from a training set and the "fake" data created by the Generator. Think of it as an "art critic" or detective.
How it Works As training progresses, the Generator gets better at producing realistic outputs to fool the Discriminator, while the Discriminator gets better at catching the fakes. This "adversarial" process continues until the Generator produces data so authentic that the Discriminator can no longer tell the difference.
Notebook - > Generation of images using MNIST dataset.




