Dhruv's AIML level-2 report
5 / 10 / 2024
AIML LEVEL 2 Report
By Dhruv Mahajan
Task 1: Decision Tree
Using the resources and the holy bible of machine learning that is The Elements Of Statistical Learning , I understood the maths behind the decision tree and implemented IDE3.
I also learnt about CART algorithm which is now more popularly used. Really enjoyed this task as it later helped me in my aiml 4th sem AI class!!
IDE3 can be explained to a person with ease by telling that it just makes sure there is lesser ambiguity in the later nodes as we ask Qs. Its like a detective working and asking more Qs to get better results!
Task 2: Naive Bayesian Classifier
First watched some beautiful videos on probability by 3blue1brown:
This helped in building basic concepts of stats and maths that I had forgotten, it also allowed me to actually visualize the theorem
Then watched statquest and understood the machine learning and stats part of it, after that wrote the code
as down below!
This was an amazing task, as it allowed me to see the blend of maths and geometry and maching learning
Task 3-4: Ensemble + XGBoost,GBM,RandomForest
I have executed all the ensemble techniques mentioned in the same notebook!
There are 3 main ensemble techniques:
- Bagging :
- Stacking:
- Boosting:
I would highly recommend watching this video for xgboost by stat quest
First Cleaned the data (This is the titanic dataset) and called the class to make objects of the models, then used the models as shown below:
Task 5: HyperParameters
Anyone, can make a model that will work, but tuning the model to work efficiently and understand what to do when less data, is when the maths comes into play.
The hyper parameters allow the model to be tuned to our personal needs, whether we want it to run fast or need better results this all depends on the user and usecases for the model the stakeholders that are going to be using the model need to make sure that the model is as they need it.
I learnt about these much more in detail from my coursera courses.
Task 6 : K-Means Clustering
This task, is one of the cooler tasks and allows us to implement really simple maths to make amazing programs. I was on a plane and made the first set of code that does K means clustering for points. I am really proud of this one, then later did it even on MNSIT dataset. This dataset is gold as its easily accessible on all platforms.
I later on used another version of this dataset from tensorflow. Really cool dataset and it was really nice to understand the overall maths in this.
Some other resource links I watched for this was: The best out there would be StatQuest teaches the maths very nicely and intuitevely. even the documentation for this task are pretty good!
Task 7 : Annomaly Detection
At first, I thought this would be a hard task, but this was one of the easiest task. We basically use all the algorithms we have learnt so far and tweak it a bit to make it detect annomalies. I have added a lot of visualization to see how these algorithms work and tried to explain in simple words in the code how and why it works.
There are 5 algorithms in the notebook:
- Local outlier Factor: works on the basis of density and uses a metric like KNN
- Isolation forest : Works on the basis of decision trees and assigns an score , anamolies are easy to split thus lower score
- one class SVM : Makes a decision boundary based on which classifies whether it is an outlier or not
- Eclipse Envelope: works on uniform distribution and basically fits the data on to the eclipse , majority fits in. The one that doesnt is considered an outlier
- K Means: Makes clusters based on k then once done with final clusters based on some threshold value it considers whether it is outlier or not
Task 8 : GAN
Understanding the theory:
This is part of deep learning and we have 2 models basically that are in a training loop that learn from each other, as any form of art that exists there must exist a critic. Without the critic the art will not flourish and never improve or go to greater heights.
Similarly, the idea of GAN is to have 2 models , 1 that is the artist Generator and 1 that is the critic Discriminator . They both learn from each other and try to get better at beating the other.
The artist here is like a con artist that is trying to sell fake art to a critic who wants real authentic art. The generator is rewarded every time when he fools the discriminator and the disciriminator is rewarded for each time he guesses the right one.
We use layers and make sure that model learns slowly (Dropout), or else it may tend to overfit easily.
My coursera course actually had taught me about this, but I learnt a lot more in detail using this tutorial I found by Nicholas Renotte
Also, for level 1 I had explored this topic before, so here it is:
This is an interactive read on CNN, I have shown how the inputs are taken and filters are used which are normally hidden inside.
This is one of the three videos, showing edge filtering basics.
This is a small run on kaggle, sadly I exhausted my whole GPU
Then used google colab as it let me do it with GPU. I have even tried CelebA dataset but that was horribly long so ended up using MNIST as easy to load, and did on jupyter. On colab stored all images on to drive for easy access.
More about the model: Using Adams optimizer and Binary Cross Entropy as loss function for both generator and discriminator.
Some helpful tips to anyone who wants to do deep learning, setting up CUDA and all the dependencies and the right versions is a huge pain, so be vary about that
I shall show the images from 0 to later layers
This contains 300+ images that I got for 100 epoch here
I have tried explaining the model in depth and how it works in the code as comments!
here are some of the images after 100 epoch starting from 0 to 99, I am really happy with this code !
:D
Task 9 : LangChain
This explains the basic of NLP:
This is the textbook, that I used cause CLRS wasnt working for me :( , I guess I must read that textbook, its prety important for placements.
link to the book on differentiation by open university
The theory
LangChain allows us to use different LLM easily and connect them all for different purposes, it's a giant framework that allows us to build chains (get output of 1 llm as input to other), agents and memory management
Now, that you have a brief understanding of langchain, in our code we basically use pypdf which can parase through the pdfs and give it back to us as a list. We do that to our maths textbook and get the list as the textbook.
Now, as we would like to make a RAG (Retrieval-Augmented Generation) , we will like to use this textbook to get our answers. This can only be done by making the data we have into smaller chunks.
This is done via vectorstore , now the chunks can be called and searched based on the query by the user.Usually we would use openAI's embedding , but due to quota, I used a custom embeding which allowed the model to work. This data is got to us by a Retriever . We now will make the RAG the rag model works based on the system prompt, then we retrieve the answer that the model gets and print it!
Task 10 : PaddleOCR
OCR: Optical Character Recognition
For this task, I at first tried to use my 2nd sem report but that failed misserably as some of the data was too near and getting mixed up as the spaces werent really clear. Then I made a model with some of my family members and their age.
This model worked but the numbers on excel go on the left , which again caused no space to be detected. Thus I resorted to using age as text and then converting it into integer using mapping function.
Also the output of paddle ocr is a list that contains the text,score and location . I have explained how that is in the code below. The boxes is the location, sorry for bad labelling.
I plotted the score on a graph and the average score , I calculated and plotted it as a line. Then after mapping the data and getting numerical age, I plotted the graph of everyones age.
The code worked and this was an amazing task, shocked how amazing this library is and the accuracy and easy implementation of it.
paddle ocr , has multiple languages and the one I used is english.