Anagha's AI-ML-001 course work.

This Report is yet to be approved by a Coordinator.

31 / 1 / 2026

Task 1 - Decision Tree based ID3 Algorithm

Decision trees- They are supervised learning algorithm for regressive or classifying task.

Classification- Simply put the decision tree is a bunch of true or false or if-else statements to classify the data according to the condition. Now, with the help of conditions we keep splitting the nodes; first from the root node to the impure node and further until we end up with a pure node or leaf node. It must be noted that not all ways of splitting can get us pure leaf nodes at the end. Also how does the model learn which is the best split?

There are many methods to find this out including gini impurity, information gain etc.

Here, I've used this algorithm to classification of the classic example-iris dataset.

Since, while using the inbuilt function, the information gain gave the best result, I coded the same from scratch. First you define a function to find entropy.

--> Now we have to decide the best split, based on the best information gain.

For this, the information gain function is defined

In this, first the values of the feature are sorted.

The base entropy is calculated.

Now, we loop through all possible thresholds.

Based on the threshold you split the data.

You proceed further into the loop only if any of your split isn’t empty.

Compute entropy on left and right subsets and weight then and compare it with your base entropy to get your gain.

Amongst the thresholds compute the best gain.

This is how you find best gain and best threshold.

--> Next we define the best feature function. Here, the feature which gives the highest best gain becomes the best feature.

--> The best feature function is called, and the best feature and threshold is selected. Now start to create nodes. Recursively call this same function until pure node is reached.

--> Use the id3 function on the training dataset and calculate the accuracy.

Task 2 - Naive Bayesian Classifier

It is a classification algorithm that is based out of bayes theorem and it’s called naïve because it believes all the inputs are independent of each other which is not true. First make calculate the probabilities of each word occurring in the normal message and spam messages separately. We’ll call these probabilities as likelihood. Now calculate whether the given message is spam or normal. We’ll call this prior probability. Now, for any given phrase you multiply the prior probability with the likelihood of each word in the phrase for spam and normal messages separately. This shall give the score for the phrase with respect to whether the message is spam or normal. The phrase shall be considered normal if the score with respect to the normal message is higher than the score with respect to the spam message.

Task 3- Ensemble techniques

Ensembling is a technique where multiple models are combined to make a more accurate and reliable prediction than a single model.

So instead of just asking one model for its opinion and basing our prediction only on its answer, we ask opinion of multiple models to increase our accuracy.

There are mainly 3 ensembling techniques

BAGGING (The parallel approach)- In this method there are a bunch of similar models created, and each model is given a slightly different version of the same dataset. They all work independently and vote for the final answer, thus classifying the data.
BOOSTING (The sequential approach)- We use one model first to get the best result. The second model is trained specifically to fix mistakes of the first, the third tries to correct the mistakes of the first two and so on. By the end we have a perfect output.
STACKING (The Hierarchical model)- In this first we use completely different models on our training data to get different results. Then a Manager model is used to learn which model works better at different instances of the training data and thus nassign weights accordingly to each of their predictions and the score is calculated. Based on the score predictions can be made.

Task-4 Random Forest, GBM and Xgboost

GBM

Gradient boosting machine is an ensembling technique that is builds a predictive model by combining multiple weak models sequentially. Here I have used titanic dataset for implementation. What it does?

It starts simple by taking the average of the people survived.
Calculates the residuals, how far each person was from the average
Now it builds a small decision tree, to predict the errors.
This new decision tree added to the initial guess. New Guess = Initial Guess + (0.1 × Correction Tree Prediction).
This process is repeated until the mistakes become very low.

XGBoost

This also works similarly to the GBM, but it is much better than it because of several "extreme" technical enhancements:

Regularization- has built in L1 and L2 regulation, thus preventing overfitting.
Handling sparse data- If a piece of info is missing, XGBoost learns which way to branch the tree based on the training data.
Parallel processing- Unlike traditional GBM XGBoost uses parallel processing.

Random forest classifier

This comes under the bagging technique of the ensembling techniques. Random forest classifier makes use of two techniques.

Bagging (Bootstrap Aggregating) – There are different decision trees built, and Instead of giving every tree the exact same data, the algorithm gives each tree a random sample of the dataset.
Random feature selection- In a Random Forest, each tree is only allowed to look at a random subset of features. This prevents overfitting. Once the training is completed, and a new data is given for classification, every tree in the forest makes its own individual prediction and gives their “votes”. According to these votes predictions are made.

Task 5- Hyperparameter Tuning

It is the process of finding the best settings for various parameters in a model so as to get the actual result. They are those parameters set by users and not learnt during the training.

Some examples include the max depth of a tree in decision trees, or no. of layers in a neural network, learning rate in regression process , the number of iterations, etc.

Our main goal in hyperparameter tuning is to look for the best combination of settings of all these parameters. Since manually testing every possibility is impossible, we use different strategies to find the winners efficiently.

Grid Search- You define a discrete set of values for each hyperparameter, and the algorithm tries every possible combination. Although its computationally expensive, It’s thorough and if the best solution is in your grid, you’ll find it.
Random search- Instead of checking every single point, Random Search picks configurations at random from a distribution.
Bayesian optimization- Here, a probabilistic model is built to predict which hyperparameters might yield the best results. The implementation

Task 6- Image Classification using KMeans Clustering

This algorithm uses support vector machines to learn a decision boundary that separates the majority of the data from the anomalies. In image clustering, first you decide into how many groups , you have to classify, that is the k. Randomly choose k objects as centroids and put the rest of the points into k groups with the k centroids. You look at each group and find the exact center point of where all those objects are present and move the centroid. Due to this change some objects move to different clusters and thus reclassification is done. This procedure is repeated until no objects move to a different cluster with any change in the centroid position. I applied this logic on MNIST dataset, to classify 10 digits into 10 clusters, but the accuracy was only 60% But when I increased clusters to 256, I got an accuracy of 90%. https://colab.research.google.com/drive/1GSwx-n2bpt4cICcTdV5uzpYjrsCgPFkv?usp=sharing https://colab.research.google.com/drive/1BqZZFwQU5UwXyPLBCWaiHruLEKipTM3k?usp=sharing

Task 7- Anomaly Detection

anomaly detection (also known as outlier detection) is the process of identifying data points, events, or observations that deviate significantly from the normal behavior of a dataset. Types:

Point Anomalies: A single data point is far off.
Contextual Anomalies: The data point is normal in one context but weird in another.
Collective Anomalies: A sequence of events looks suspicious, even if individual points don't. There are 3 approaches for anamoly detection:
Supervised- Here there is clearly labeled dataset where the normal and anomaly data are clearly marked.
Unsupervised- Here the data is unlabeled.
Semi-supervise- The model is trained only on "normal" data. Once it understands perfection, it flags anything that doesn't fit that specific mold.

I have used the LOF, Isolation forest and one-class SVM algorithm on the data

LOF- (Local outlier factor)- For every point, the algorithm finds the k nearest neighbors. In most datasets, the density of points is relatively uniform, with only a few points having significantly lower or higher densities than the rest. The LOF algorithm uses this property to identify points that have a significantly lower density than their neighbors, which are likely to be anomalies.

Isolation forest- In this algorithm decision trees are used to isolate the outliers. Using decision trees , majority of the points can be reached with only a few splits. Anomalies, on the other hand, are typically isolated from the rest of the data, requiring many splits to reach them in the decision tree.

One class SVM- This algorithm uses support vector machines to learn a decision boundary that separates the majority of the data from the anomalies.

Task 8- Generative AI Task Using GAN

GAN A Generative Adversarial Network (GAN) is a type of machine learning framework where two neural networks "compete" against each other to create highly realistic synthetic data. It has two part:

Generator- It is responsible for creating fake data. This network starts with random noise as input and keeps getting better at creating life like data.
Discriminator- Its job is to differentiate between real and fake data. This too keeps getting better at identifying fakes, by trying to identify the details of the data more keenly every next time. Finally, It reaches a point called Nash Equilibrium, where the Generator is so good at creating fakes that the Discriminator can only guess at random. TYPES
Vanilla Gan- Uses simpke neural networks
DCGAN- Uses CNN.
Cycle GAN-Translates images from one domain to another without needing paired examples.
Style gan- Most precise and focuses on details. I have implemented and trained GAN model to celeba dataset using DCGAN model. https://colab.research.google.com/drive/1G_Hv615UdfygKk9unFtbF2f0V-n1OMSY?usp=sharing

Task9- PDF Query Using LangChain

If we consider LLM on its own is like a brilliant brain in a jar, LangChain provides the nervous system and hands. It allows the model to connect to your data, remember past conversations, and execute complex sequences of tasks. LangChain is organized into several key modules that handle different parts of the development:

Model I/O: This handles the "talk" between you and the AI.

RAG-It allows the AI to "read" your specific documents (PDFs, databases, or websites) to answer questions without needing to be retrained.

Chains- It allows you to link different components together.

Memory-Memory components allow the AI to store and recall previous parts of the conversation.

Agents-You give the AI a goal and a set of tools, and it decides which tools to use and in what order to solve the problem

So , while building the pdf query system, we are essentially implementing RAG. Steps:

Load the pdf
Split the text into chunks.
Embedding-It sends each chunk to Google's Embedding Model, which turns the text into a long list of numbers (a vector).
Vector Store (FAISS): All those lists of numbers are stored in a specialized database.
When you ask a question, • The program turns your question into a vector of numbers using the same embedding model, It then scans the Library (FAISS) to find the 3 or 4 text chunks whose numbers look most similar to your question's numbers
Now, that those 3,4 chunks are read the program instructs the gemini to answer precisely. Since I faced issues using langchain.chains, due to change in documentation in 2026, I used gemini apikey to fetch services from gemini server.

https://colab.research.google.com/drive/1w_hIKtWpXy5IG4jhBnLvNh6UKMjLpXXn?usp=sharing

Task 10- Table Analysis Using EasyOCR

OCR (Optical Character Recognition) is a technology that turns images of text into actual, editable digital text.

How does it do it?

It first edits the given image such that the text posps out clearly. It detects the text by drawing boxes or lines around it. This is called pre-processing. Next it compares the characters with stored database of fonts. This process is called character recognition.

Finally, it does post-processing where spelling corrections, checking of language rules and layout understanding is done.

Due to difficulties in implementing paddle ocr, I have implemented easyocr.

https://colab.research.google.com/drive/1axiJMvWgz1q3jPbL2LPZAD3vpo0c1NU8?usp=sharing

Anagha's AI-ML-001 course work. Lv 3