This Report is yet to be approved by a Coordinator.

Marvel level 2

24 / 4 / 2024

Documentation-Marvel-Level-3

Task 1 - Decision Tree based ID3 Algorithm

To predict the quality of wine, I used Decision Tree using the ID3 algorithm. I built the ID3 algorithm from scratch which involved calculating entropy, information gain, and weighted average entropy to identify the optimal feature for each split. Features like acidity and alcohol content, were important predictors of wine quality.
Code

Task 2: Naive Bayesian Classifier

Built a Naive Bayesian Classifier that works on BBC's data and categorizers texts into entertainment, tech, business, sport, etc. The categorical data is converted into numerical data such that it can be interpreted by the machine. This is done by analyzing repeated words in each category. The categorical data is converted into numerical data.
Code

Task 3 - Ensemble techniques

Applied Ensemble Techniques to the titanic dataset. To this dataset, I incooperated all three ensemble techniques. Ensemble learning refers to algorithms that combine the predictions from two or more models. Combining models like Random Forest, Decision Trees, and Gradient Boosting increased the accuracy of the model.
Code

Task 4 - Random Forest, GBM and Xgboost

1. Random Forest

Used a random foreset classifier to predict if a patient is with heart disease. Random Forest Classifiers are a collection of individual decision trees. More is uncorrelation between the Decision trees , more is the accuracy of the Random Forest classifier.
Code

2. GBM

Used Gradient Boosting Classifier to predict if a patient is with breast cancer or not. In GBM, the week learning models combine with the stronger learning models. Boosting is one kind of ensemble Learning method which trains the model sequentially and each new model tries to correct the previous model.
Code

3. XGBoost

XGBoost is also an Ensemble learning method, that stands for Extreme Gradient Boosting. Here, I've used XGBoost to predict the if a person would return the loan he/she has taken from the bank.
Code

Task 5 - Hyperparameter Tuning

Used Hyperparameter tuning to increase the accurcy from 81% to 92% of Student performance dataset. I first create a parameter grid that tries different values of each parameters such as max_depth, min_samples_split, min_samples_leaf, criterion and find the fest values for each parameter.
Code

Before Hyperparameter Tuning

After Hyperparameter Tuning

Task 6 : Image Classification using KMeans Clustering

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters. This first step for this task was to find the 'k' value and initialize the centroids. Next, by converting the images into numerical data and applying clustering, I classified image segments.
Code

Task 7: Anomaly Detection

I learned about anomaly detection techniques and implemented them on the load_linnerud dataset(toy dataset), which contains physiological and exercise data. I applied the Local Outlier Factor (LOF) algorithm to identify outliers in the dataset. In this datset the outliers would be unusual physiological responses or extreme exercise measurements.
Code

Task 8: Generative AI Task Using GAN

In GAN two models are trained simultaneously. These two models are Generator and discriminator. The generator is the artist that learns to create images that look real, while a discriminator is the art critic that learns to tell real images apart from fakes. During training, the generator progressively becomes better at creating images that look real, while the discriminator becomes better at telling them apart. The process reaches equilibrium when the discriminator can no longer distinguish real images from fakes.
Code

Task 9: PDF Query Using LangChain

LangChain is a framework that allows developers to create agents capable of reasoning about issues and breaking them down into smaller sub-tasks. I used libraries such as HuggingFaceEmbeddings which uses sentence-transformers to generate embeddings for the text chunks, and RetrievalQA Chain which combines retrieval with the LLM to provide answers to your questions.
Code

Task 10: Table Analysis Using PaddleOCR

PaddleOCR is an open-source Optical Character Recognition (OCR) tool developed by PaddlePaddle, a deep learning platform from Baidu. It's a powerful library that supports text detection, recognition, and even layout analysis for a wide range of document types.
Code

Varsha's AI-ML-001 course work. Lv 3