
BLOG · 10/9/2025

Decision trees- They are supervised learning algorithm for regressive or classifying task.
Classification- Simply put the decision tree is a bunch of true or false or if-else statements to classify the data according to the condition. Now, with the help of conditions we keep splitting the nodes; first from the root node to the impure node and further until we end up with a pure node or leaf node. It must be noted that not all ways of splitting can get us pure leaf nodes at the end. Also how does the model learn which is the best split?
There are many methods to find this out including gini impurity, information gain etc.
Here, I've used this algorithm to classification of the classic example-iris dataset.
Since, while using the inbuilt function, the information gain gave the best result, I coded the same from scratch.
First you define a function to find entropy.

--> Now we have to decide the best split, based on the best information gain.
For this, the information gain function is defined
In this, first the values of the feature are sorted.
The base entropy is calculated.
Now, we loop through all possible thresholds.
Based on the threshold you split the data.
You proceed further into the loop only if any of your split isn’t empty.
Compute entropy on left and right subsets and weight then and compare it with your base entropy to get your gain.
Amongst the thresholds compute the best gain.
This is how you find best gain and best threshold.
--> Next we define the best feature function. Here, the feature which gives the highest best gain becomes the best feature.
--> The best feature function is called, and the best feature and threshold is selected. Now start to create nodes. Recursively call this same function until pure node is reached.
--> Use the id3 function on the training dataset and calculate the accuracy.
It is a classification algorithm that is based out of bayes theorem and it’s called naïve because it believes all the inputs are independent of each other which is not true. First make calculate the probabilities of each word occurring in the normal message and spam messages separately. We’ll call these probabilities as likelihood. Now calculate whether the given message is spam or normal. We’ll call this prior probability. Now, for any given phrase you multiply the prior probability with the likelihood of each word in the phrase for spam and normal messages separately. This shall give the score for the phrase with respect to whether the message is spam or normal. The phrase shall be considered normal if the score with respect to the normal message is higher than the score with respect to the spam message.
Ensembling is a technique where multiple models are combined to make a more accurate and reliable prediction than a single model.
So instead of just asking one model for its opinion and basing our prediction only on its answer, we ask opinion of multiple models to increase our accuracy.
There are mainly 3 ensembling techniques
BAGGING (The parallel approach)- In this method there are a bunch of similar models created, and each model is given a slightly different version of the same dataset. They all work independently and vote for the final answer, thus classifying the data.
BOOSTING (The sequential approach)- We use one model first to get the best result. The second model is trained specifically to fix mistakes of the first, the third tries to correct the mistakes of the first two and so on. By the end we have a perfect output.
STACKING (The Hierarchical model)- In this first we use completely different models on our training data to get different results. Then a Manager model is used to learn which model works better at different instances of the training data and thus nassign weights accordingly to each of their predictions and the score is calculated. Based on the score predictions can be made.
GBM
Gradient boosting machine is an ensembling technique that is builds a predictive model by combining multiple weak models sequentially. Here I have used titanic dataset for implementation. What it does?
XGBoost
This also works similarly to the GBM, but it is much better than it because of several "extreme" technical enhancements:
Random forest classifier
This comes under the bagging technique of the ensembling techniques. Random forest classifier makes use of two techniques. 1.Bagging (Bootstrap Aggregating) – There are different decision trees built, and Instead of giving every tree the exact same data, the algorithm gives each tree a random sample of the dataset. 2. Random feature selection- In a Random Forest, each tree is only allowed to look at a random subset of features. This prevents overfitting. Once the training is completed, and a new data is given for classification, every tree in the forest makes its own individual prediction and gives their “votes”. According to these votes predictions are made.
It is the process of finding the best settings for various parameters in a model so as to get the actual result. They are those parameters set by users and not learnt during the training.
Some examples include the max depth of a tree in decision trees, or no. of layers in a neural network, learning rate in regression process , the number of iterations, etc.
Our main goal in hyperparameter tuning is to look for the best combination of settings of all these parameters. Since manually testing every possibility is impossible, we use different strategies to find the winners efficiently.
Grid Search- You define a discrete set of values for each hyperparameter, and the algorithm tries every possible combination. Although its computationally expensive, It’s thorough and if the best solution is in your grid, you’ll find it.
Random search- Instead of checking every single point, Random Search picks configurations at random from a distribution.
Bayesian optimization- Here, a probabilistic model is built to predict which hyperparameters might yield the best results. The implementation
This algorithm uses support vector machines to learn a decision boundary that separates the majority of the data from the anomalies. In image clustering, first you decide into how many groups , you have to classify, that is the k. Randomly choose k objects as centroids and put the rest of the points into k groups with the k centroids. You look at each group and find the exact center point of where all those objects are present and move the centroid. Due to this change some objects move to different clusters and thus reclassification is done. This procedure is repeated until no objects move to a different cluster with any change in the centroid position. I applied this logic on MNIST dataset, to classify 10 digits into 10 clusters, but the accuracy was only 60% But when I increased clusters to 256, I got an accuracy of 90%. https://colab.research.google.com/drive/1GSwx-n2bpt4cICcTdV5uzpYjrsCgPFkv?usp=sharing https://colab.research.google.com/drive/1BqZZFwQU5UwXyPLBCWaiHruLEKipTM3k?usp=sharing
anomaly detection (also known as outlier detection) is the process of identifying data points, events, or observations that deviate significantly from the normal behavior of a dataset. Types:
I have used the LOF, Isolation forest and one-class SVM algorithm on the data
LOF- (Local outlier factor)- For every point, the algorithm finds the k nearest neighbors. In most datasets, the density of points is relatively uniform, with only a few points having significantly lower or higher densities than the rest. The LOF algorithm uses this property to identify points that have a significantly lower density than their neighbors, which are likely to be anomalies.
Isolation forest- In this algorithm decision trees are used to isolate the outliers. Using decision trees , majority of the points can be reached with only a few splits. Anomalies, on the other hand, are typically isolated from the rest of the data, requiring many splits to reach them in the decision tree.
One class SVM- This algorithm uses support vector machines to learn a decision boundary that separates the majority of the data from the anomalies.
GAN A Generative Adversarial Network (GAN) is a type of machine learning framework where two neural networks "compete" against each other to create highly realistic synthetic data. It has two part:
If we consider LLM on its own is like a brilliant brain in a jar, LangChain provides the nervous system and hands. It allows the model to connect to your data, remember past conversations, and execute complex sequences of tasks. LangChain is organized into several key modules that handle different parts of the development:
Model I/O: This handles the "talk" between you and the AI.
RAG-It allows the AI to "read" your specific documents (PDFs, databases, or websites) to answer questions without needing to be retrained.
Chains- It allows you to link different components together.
Memory-Memory components allow the AI to store and recall previous parts of the conversation.
Agents-You give the AI a goal and a set of tools, and it decides which tools to use and in what order to solve the problem
So , while building the pdf query system, we are essentially implementing RAG. Steps:
https://colab.research.google.com/drive/1w_hIKtWpXy5IG4jhBnLvNh6UKMjLpXXn?usp=sharing
OCR (Optical Character Recognition) is a technology that turns images of text into actual, editable digital text.
How does it do it?
It first edits the given image such that the text posps out clearly. It detects the text by drawing boxes or lines around it. This is called pre-processing. Next it compares the characters with stored database of fonts. This process is called character recognition.
Finally, it does post-processing where spelling corrections, checking of language rules and layout understanding is done.
Due to difficulties in implementing paddle ocr, I have implemented easyocr.
https://colab.research.google.com/drive/1axiJMvWgz1q3jPbL2LPZAD3vpo0c1NU8?usp=sharing