cover photo

COURSEWORK

Shreya's AI-ML-001 course work. Lv 3

Shreya NayakAUTHORACTIVE
This Report is yet to be approved by a Coordinator.

LEVEL 2 - AIML

1 / 4 / 2025


Task 1 - Decision tree based ID3 algorithm

The Decision Tree-based ID3 (Iterative Dichotomiser 3) algorithm is a supervised learning algorithm that recursively splits data using entropy and information gain to construct a tree for classification.

Steps to Implement:

  1. Import Libraries and Load the Dataset
  2. Preprocess Data (Encoding Categorical Variables)
  3. Define Entropy Calculation
  4. Implement Dataset Splitting Function
  5. Compute Best Attribute for Splitting (Information Gain)
  6. Build the Decision Tree Using Recursion
  7. Train the Model on Processed Data
  8. Convert and Print Decision Tree in Readable Format

Code Implementation:

View the code on GitHub


Task 2 - Naive Bayesian Classifier

Naive Bayesian Classifier is a probabilistic algorithm based on Bayes' Theorem, assuming independence among predictors.

Steps to Implement:

  1. Import libraries and load the dataset.
  2. Preprocess data (cleaning, encoding, etc.).
  3. Split data into train and test sets.
  4. Train a Naive Bayes model (GaussianNB, MultinomialNB, etc.).
  5. Evaluate model performance (accuracy, F1-score, etc.).
  6. Test on unseen data.

Types of Naive Bayesian Classifiers

  1. Gaussian Naive Bayes: Assumes features follow a normal distribution; used for continuous data.
  2. Multinomial Naive Bayes: Suitable for discrete data like word counts in text classification.
  3. Bernoulli Naive Bayes: Works with binary/boolean features (e.g., spam detection).

Code Implementation:

View the code on GitHub


Task 3 - Ensemble Techniques

Ensemble techniques in machine learning combine multiple models to achieve better performance than any individual model alone. By aggregating the predictions of several base models, ensemble methods reduce variance, bias, or improve prediction accuracy. They are particularly effective in handling complex datasets and improving model robustness. Common ensemble methods include Bagging (e.g., Random Forest), Boosting (e.g., Gradient Boosting, XGBoost), and Stacking.

Code Implementation:

View the code on GitHub


Task 4 - Random Forest, GBM, XGBoost

Random Forest: An ensemble learning method that constructs multiple decision trees and combines their outputs to improve accuracy and reduce overfitting.

GBM (Gradient Boosting Machine): A boosting algorithm that builds trees sequentially, each correcting the errors of the previous ones by minimizing a loss function.

XGBoost (Extreme Gradient Boosting): An optimized version of GBM that uses regularization, parallel processing, and efficient handling of missing values for better speed and performance.

Code Implementation:

View the code on GitHub


Task 5 - Hyperparameter Tuning

Hyperparameter tuning is the process of optimizing model settings (hyperparameters) that are not learned from data, to improve a machine learning model's performance.

Types of Hyperparameter Tuning

  1. Grid Search: Exhaustively tests all possible combinations of hyperparameters.
  2. Random Search: Randomly selects hyperparameter values within a given range.
  3. Bayesian Optimization: Uses probabilistic models to efficiently search for the best hyperparameters.
  4. Genetic Algorithms: Evolves hyperparameters using mutation and crossover techniques.
  5. AutoML: Automatically optimizes hyperparameters using machine learning-based search methods

Code Implementation:

View the code on GitHub


Task 6 - Image Classification through KMeans clustering

Image classification using K-Means clustering groups similar pixels or features into K clusters, enabling segmentation or categorization without labeled data. It is commonly used for color quantization and object segmentation.

Code Implementation:

View the code on GitHub


Task 7 - Anomaly Detection

Anomaly detection identifies erroneous or unusual data points in a dataset by analyzing statistical differences. It can be performed using unsupervised or supervised learning methods.

Steps to Implement:

  1. Import libraries and load the dataset.
  2. Preprocess data (normalize, clean, etc.).
  3. Choose an anomaly detection algorithm.
  4. Train the model on the dataset.
  5. Evaluate results by identifying anomalies.
  6. Visualize anomalies using scatter plots or other graphs.

Code Implementation:

View the code on GitHub


Task 8 - Generative AI using GAN

Generative AI using GANs (Generative Adversarial Networks) involves two neural networks—a generator that creates realistic data and a discriminator that distinguishes real from fake data—competing in a game-like setup to generate high-quality synthetic content, such as images, text, or audio.

Code Implementation:

View the code on GitHub


Task 9 - PDF Query using LangChain

PDF query using LangChain is a process where text is extracted from a PDF, converted into embeddings, stored in a vector database, and then queried using natural language processing to retrieve relevant information efficiently.

Steps:

  1. Extract Text: Read and extract text from a PDF file using a library like PyMuPDF or pdfplumber.
  2. Split & Process Text: Divide extracted text into manageable chunks for better querying.
  3. Embed Text: Convert text into vector embeddings using a model like OpenAI or FAISS.
  4. Store in Vector Database: Save embeddings in a vector store (e.g., FAISS) for efficient retrieval.
  5. Query Processing: Accept user queries and convert them into vector representations.
  6. Retrieve Relevant Sections: Search the vector database for the most relevant text chunks.
  7. Generate Response: Use an LLM (like OpenAI) to generate human-like answers from retrieved data.

Code Implementation:

View the code on GitHub


Task 10 - Table Analysis using PaddleOCR

PaddleOCR is an open-source Optical Character Recognition (OCR) tool based on PaddlePaddle, designed for extracting text from images or scanned documents. It supports multiple languages, handles complex layouts, and provides high accuracy for text detection and recognition.

Code Implementation:

View the code on GitHub


UVCE,
K. R Circle,
Bengaluru 01