cover photo

COURSEWORK

Vaishnav's AI-ML-001 course work. Lv 3

Vaishnav SAUTHORACTIVE
This Report is yet to be approved by a Coordinator.

Level 2 report

31 / 3 / 2025


Marvel Level 3 Report

Task 1: Decision Tree Based on ID3 Algorithm

A decision tree is a tree-like structure used to make decisions or predictions based on given conditions. It is a machine learning algorithm that can be used for both classification and regression tasks.

Key Learnings:

  • Terminologies: Root node, internal nodes, branches, leaf nodes, tree height.
  • Reducing impurities: Techniques like Gini Impurity, Information Gain, Entropy, and Mean Squared Error (MSE).
  • Overfitting problem: A very deep tree can lead to overfitting, reducing prediction accuracy.
  • Pruning: A method to remove less important branches to improve generalization.
  • Splitting criteria: Algorithms such as ID3, C4.5, and CART help decide the optimal order of features.

ID3 Algorithm:

  • ID3 (Iterative Dichotomiser 3) uses entropy and information gain to build a decision tree.

  • At each branching stage, the split with the highest information gain is selected.

  • Notes IG.jpg entropy.jpg

Code Implementation: GitHub Repository


Task 2: Naïve Bayesian Classifier

The Naïve Bayes classifier is a probabilistic machine learning model used for classification tasks. It is based on Bayes' Theorem and assumes that all features are independent, which is often an unrealistic assumption, hence the term "Naïve."

Key Learnings:

  • Despite its assumptions, Naïve Bayes is highly effective for tasks like text classification and spam detection.
  • Types of Naïve Bayes:
    • Gaussian Naïve Bayes (for continuous data)
    • Multinomial Naïve Bayes (for text classification)
    • Bernoulli Naïve Bayes (for binary features)

nb1.png Code Implementation: Naïve Bayes from scratch


Task 3: Anomaly Detection

Anomaly detection involves identifying deviations from expected or normal data patterns.

Types of Anomalies:

  1. Point Anomaly: A single data point deviates significantly from the rest.
  2. Contextual Anomaly: An anomaly that depends on the context (e.g., seasonal sales spikes).
  3. Collective Anomaly: A group of points that together form an anomaly.

Key Learnings:

  • Used KNN clustering to find anomalies in the data.
  • Plotted contour plots to visualize anomalies. anomaly2.png
  • Implemented different algorithms to detect anomalies. anomaly1.png

-Code Implementation: Finding Anomalie

-Code Implementation: Implementing anomaly detecting algorithms


Task 4: Ensemble Techniques

Ensemble learning is a technique that combines multiple models to improve predictive performance by reducing errors, overfitting, and variance.

Types of Ensemble Techniques:

  1. Bagging (Bootstrap Aggregating)

    • Multiple models (bags) are trained on different subsets of data.
    • The average prediction is used for final decision-making.
  2. Boosting (AdaBoost)

    • Incorrectly classified instances are given higher weights in the next model iteration.
    • Helps improve weak classifiers iteratively.
  3. Stacking

    • Combines predictions from multiple models into a new model for improved accuracy.
  4. Voting & Averaging

    • Uses multiple models and aggregates their predictions.
    • Hard Voting: The class predicted by the majority is chosen.
    • Soft Voting: Averages the predicted probabilities for classification.

-Code Implementation: Bagging, Voting and Boosting

-Code Implementation: Stacking


Task 5: KMeans Clustering

KMeans is an unsupervised learning algorithm used for clustering data into K groups.

Steps:

  • Choose K and initialize centroids.
  • Assign data points to nearest centroid.
  • Recalculate centroids.
  • Repeat until convergence.

Screenshot-107.png

-Code Implementation: Code


Task 6: Generative AI Task Using GAN

GAN (Generative Adversarial Network) is a model with two networks:

  • Generator: creates fake data
  • Discriminator: distinguishes between real and fake data

They compete against each other until the discriminator can’t tell real from fake.

-Code Implementation: Code


Task 7: PDF Query Using LangChain

LangChain is an open-source framework for building apps using LLMs (Large Language Models).

I used it to parse a budget PDF file and query specific information using:

  • PDF parsing techniques
  • LLM-powered question answering

-Code Implementation: Code


Task 8: PaddleOCR

PaddleOCR is an open-source Optical Character Recognition tool that supports:

  • Text detection & recognition
  • Multiple languages
  • Lightweight and fast integration

I used it to extract text from a photo of a bill.

-Code Implementation: Code


Task 9: Random Forest, GBM, and XGBoost

Random Forest

  • A bagging ensemble of decision trees.
  • Uses majority voting for prediction.

rf.jpg

GBM (Gradient Boosting Machine)

  • A boosting technique where each tree learns to correct errors from the previous tree.
  • Uses loss minimization, often with MSE. boost.jpg gb.jpg

XGBoost

  • An optimized version of GBM.
  • Includes:
    • Regularization
    • Parallel processing
    • Missing value handling
    • Tree pruning

-Code Implementation: Code


##Task 10: Hyperparameter Tuning

Hyperparameter tuning is the process of selecting the best set of hyperparameters (not learned from data) for model training.

Techniques:

  • GridSearchCV
  • RandomSearchCV
  • Bayesian Optimization

-Code Implementation: Code


UVCE,
K. R Circle,
Bengaluru 01