cover photo

COURSEWORK

Monika's AI-ML-001 course work. Lv 3

Monika DAUTHORACTIVE
This Report is yet to be approved by a Coordinator.

AI-ML Level 3 Report

5 / 1 / 2025


Task 1: Decision Tree-Based ID3 Algorithm

Decision Tree

A tree-like model used for decision-making and classification. Each node represents a feature, each branch represents a decision rule, and each leaf represents an outcome.

ID3 (Iterative Dichotomiser 3) Algorithm

The ID3 algorithm is a popular method for building a decision tree based on entropy and information gain concepts.

Steps of the ID3 Algorithm

  1. Calculate the Entropy of the Dataset: Compute the overall entropy of the target variable.
  2. Evaluate Information Gain for Each Feature: For each feature, calculate the information gain by splitting the data based on its values.
  3. Select the Best Feature: Choose the feature with the highest information gain as the root node.
  4. Split the Dataset: Partition the dataset into subsets based on the chosen feature.
  5. Repeat Recursively:
    • Apply the above steps to each subset until:
      • All features are used.
      • The target variable is perfectly classified (entropy = 0).

ID3 Algorithm


Task 2: Naive Bayesian Classifier

Overview

Bayes' theorem is used to find the probability of a hypothesis given evidence.

Naive Bayes Algorithm

Naive Bayes is a machine learning algorithm commonly used for classification tasks.

Example: Spam Mail Classification

  • Naive Bayes analyzes each word in an email (e.g., "free," "win") and estimates if it's spam based on how frequently those words appear in spam emails.
  • It treats each word independently, without considering their relationships, to make predictions.

Naive Bayesian Classifier


Task 3: Ensemble Techniques

Ensemble learning is a machine learning technique that enhances accuracy and resilience in predictions by combining multiple models.

Types of Ensemble Techniques

1. Bagging

  • Description: Trains multiple models independently on random subsets of the data and combines their predictions to reduce overfitting.
  • Key Feature: Lowers variance by averaging predictions (regression) or majority voting (classification).

2. Boosting

  • Description: Trains models sequentially, with each model improving on the errors of the previous one.
  • Key Feature: Combines all models' outputs to reduce bias and improve accuracy.

3. Stacking

  • Description: Combines predictions from multiple diverse models using a meta-model to improve performance.
  • Key Feature: Allows different algorithms to contribute their strengths for better accuracy.

4. Blending

  • Description: Uses a validation set to combine predictions from multiple models using simple techniques like averaging.
  • Key Feature: An easier version of stacking but may not be as powerful.

Ensemble Techniques


Task 7: Anomaly Detection

Anomaly detection identifies unusual data points or patterns in a dataset and has applications in various fields such as finance, healthcare, and cybersecurity.

Anomaly Detection Algorithms

1. Isolation Forest

  • Description: Isolates anomalies by creating random decision trees and identifying data points with shorter paths as outliers.
  • Key Feature: Leverages the sparsity of anomalies in the dataset.

2. Local Outlier Factor (LOF)

  • Description: Detects anomalies by comparing the local density of a point to its neighbors.
  • Key Feature: Flags points with significantly lower density as outliers.

3. One-Class SVM

  • Description: Uses a hyperplane in feature space to separate normal data from anomalies.
  • Key Feature: Identifies points lying on the opposite side of the hyperplane as outliers.

Isolation Forest LOF and SVM


Task 4: Random Forest, GBM, and XGBoost

Random Forest

Random Forest builds multiple decision trees independently and combines their results (majority vote for classification, average for regression). It helps reduce overfitting and works well on large datasets.

Gradient Boosting (GBM)

GBM builds decision trees sequentially, where each tree corrects the mistakes of the previous ones. It gives better accuracy than Random Forest but can be slower and prone to overfitting.

XGBoost

XGBoost is an optimized version of GBM, making it much faster and more efficient. It uses advanced techniques like tree pruning and parallel computation to improve speed and accuracy.

Task 4


Task 5: Hyperparameter Tuning

Hyperparameters

Hyperparameters are settings that we choose before training a machine learning model. They control how the model learns and affect its performance.

Implementation Steps:

  1. Load Titanic dataset
  2. Train baseline model – Fit a Random Forest model with default settings.
  3. Evaluate baseline accuracy – Measure initial performance.
  4. Define hyperparameter grid – List different values to test.
  5. Using RandomizedSearchCV or GridSearchCV – Finding the best hyperparameter combination.
  6. Train best model – Using optimized hyperparameters for training.

Hyperparameter tuning increases accuracy by selecting the best settings, making the model better at predicting survival on the Titanic dataset.

RandomisedSearchCV GridSearchCV


Task 6: Image Classification Using K-Means Clustering

K-Means Clustering

K-Means is an unsupervised machine learning algorithm used for clustering data into K distinct groups based on similarity. It minimizes the variance within clusters by iteratively updating cluster centers (centroids) and reassigning points.

Key Idea:

  • Each data point is assigned to the closest cluster center.
  • The cluster centers are updated iteratively to minimize intra-cluster variance.
  • The algorithm stops when clusters no longer change significantly.

How It Works for Image Classification:

  1. Extract features from images (e.g., pixel intensities, edge features, deep learning embeddings).
  2. Apply K-Means clustering to group similar images or pixels into categories.
  3. Assign cluster labels to classify images.
  4. Evaluate accuracy by comparing with true labels (if available).

K-Means Clustering


Task 9: PDF Query Using LangChain

PDF Query using LangChain enables users to extract relevant information from PDFs using natural language queries. It processes PDFs by extracting text, converting it into vector embeddings, storing them in FAISS, and retrieving the most relevant sections based on user queries, improving document search efficiency.

Steps Followed:

  1. Upload PDF File using files.upload() in Google Colab.
  2. Extract text from PDF using PyPDFLoader.
  3. Load sentence-transformer model (all-MiniLM-L6-v2) for text embeddings.
  4. Split text into chunks (500 characters with 50-character overlap) using RecursiveCharacterTextSplitter.
  5. Convert text into vector embeddings using the HuggingFace model.
  6. Store embeddings in FAISS for efficient retrieval.
  7. Process user query (e.g., “What is the main topic of this document?”).
  8. Retrieve the top 3 most relevant text chunks using FAISS similarity search.
  9. Display the retrieved text sections as query responses.

PDF Query


Task 10: Table Analysis Using PaddleOCR

PaddleOCR (POCR)

PaddleOCR is an open-source Optical Character Recognition (OCR) toolkit. It is designed for text detection and recognition in images and scanned documents, supporting multiple languages and handwritten or printed text.

Table img

Steps:

  1. Load Image – Reads the image file using OpenCV (cv2.imread).
  2. Extract Text – Uses PaddleOCR to recognize and extract text from the image.
  3. Process OCR Output – Organizes extracted text into a pandas DataFrame, converting numerical fields properly.
  4. Analyze Data – Prints the extracted table and basic statistics.
  5. Visualize Data – Plots a bar chart comparing order amounts by date using Matplotlib.
  6. Execute Pipeline – Calls all functions in main(), running the full process.

PaddleOCR


Task 8: Generative AI Task Using GAN

This task uses a Generative Adversarial Network (GAN) to create realistic images from the CIFAR-10 dataset.

Steps:

  • Import & Setup: Load libraries, set device, define hyperparameters.
  • Data: Load CIFAR-10, normalize to [-1, 1], create DataLoader.
  • Models:
    • Generator: Upsample latent vector using ConvTranspose2d.
    • Discriminator: Downsample image using Conv2d, classify real/fake.
  • Training:
    • Use BCELoss.
    • Separate Adam optimizers for G and D.
    • Alternate training D and G each batch.
  • Visualize: Generate and display fake images after each epoch.

GAN

UVCE,
K. R Circle,
Bengaluru 01