cover photo

BLOG · 28/9/2024

Level 3 Final report - Hema Shenoy

Hema Shenoy.D
Hema Shenoy.D
OP
Level 3 Final report - Hema Shenoy
This Article is yet to be approved by a Coordinator.

Level 3 Task Report

Task 1: Decision Tree based ID3 Algorithm

Understanding Basic Terminology

The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree algorithm used for classification tasks. It employs a top-down, greedy approach to create a tree structure by choosing the attribute that maximizes information gain at each node.

Key Concepts:

  • Entropy: A measure of impurity or disorder in a dataset.
  • Information Gain: The reduction in entropy after a dataset is split on an attribute.

Implementation:

The implementation of the ID3 algorithm involves:

  1. Calculating entropy for the dataset.
  2. Determining the best attribute based on information gain.
  3. Recursively building the tree until all data points are classified.

GitHub Link: ID3 Implementation


Task 2: Naive Bayesian Classifier

Understanding Naive Bayesian Classifier

The Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem, assuming independence between features. It is particularly useful for text classification tasks.

Key Concepts:

  • Prior Probability: The initial assessment of the class probability.
  • Likelihood: The probability of the features given a class.
  • Posterior Probability: The updated probability of a class after considering the features.

Implementation:

The classifier can be implemented using the following steps:

  1. Calculate prior probabilities for each class.
  2. Compute likelihoods for each feature.
  3. Use Bayes' theorem to predict the class for new data.

GitHub Link: Naive Bayesian Classifier Implementation


Task 3: Ensemble Techniques

What are Ensemble Techniques?

Ensemble techniques combine multiple models to improve the overall performance of predictions. They leverage the strengths of individual models to reduce errors.

Key Techniques:

  • Bagging: A method that creates multiple subsets of the data and trains models on them (e.g., Random Forest).
  • Boosting: Sequentially applies weak models, each correcting the errors of the previous ones (e.g., AdaBoost).

Application on Titanic Dataset:

Ensemble techniques were applied to enhance prediction accuracy regarding passenger survival, utilizing models such as Random Forest and Gradient Boosting.

GitHub Link: Ensemble Techniques Implementation


Task 4: Random Forest, GBM, and XGBoost

Random Forest

A powerful ensemble method that creates a multitude of decision trees and merges their results to improve accuracy and control overfitting.

Gradient Boosting Machines (GBM)

An ensemble technique that builds models sequentially, with each new model attempting to correct errors made by the previous ones.

XGBoost

An optimized version of GBM that is faster and performs better, utilizing techniques like regularization to prevent overfitting.

GitHub Link: Random Forest, GBM, and XGBoost Implementation


Task 5: Hyperparameter Tuning

Understanding Hyperparameter Tuning

Hyperparameter tuning involves adjusting the parameters of a machine learning model to optimize performance. This can significantly impact the accuracy of the model.

Techniques:

  • Grid Search: Exhaustively searching through a specified subset of hyperparameters.
  • Random Search: Randomly selecting hyperparameters from a defined distribution.

Implementation:

Choosing a Titanic dataset, the model's hyperparameters were tuned to enhance its predictive capabilities.

GitHub Link: Hyperparameter Tuning Implementation


Task 6: Image Classification using KMeans Clustering

Understanding KMeans Clustering

KMeans clustering is an unsupervised learning algorithm that partitions data into ‘k’ clusters based on feature similarity. It finds centroids by averaging data points.

Implementation:

Using the MNIST dataset, KMeans was employed to classify images into distinct categories based on pixel intensity.

GitHub Link: KMeans Clustering Implementation


Task 7: Anomaly Detection

Understanding Anomaly Detection

Anomaly detection identifies rare items, events, or observations that raise suspicions by differing significantly from the majority of the data.

Methods:

  • Statistical Methods: Utilize statistical tests to detect outliers.
  • Machine Learning Approaches: Implement supervised or unsupervised learning methods to identify anomalies.

GitHub Link: Anomaly Detection Implementation


Task 8: Generative AI Using GAN

Overview

A generative adversarial network (GAN) was developed to generate realistic images of animals. The architecture was customized, and the GAN was trained on a chosen dataset to produce high-quality synthetic images.

Implementation:

  • Developed a GAN model to generate animal images.

Repository Link: GAN implementation


Task 9: PDF Query Using LangChain

Overview

Utilize LangChain, a natural language processing framework, to extract relevant information from PDF documents based on user queries.

Implementation Steps:

  1. Query Interpretation: Develop a system that can interpret user queries.
  2. PDF Processing: Process PDF documents and retrieve relevant sections or excerpts using language understanding techniques.

Outcomes:

  • Development of a PDF query system using LangChain.
  • Implementation of PDF parsing and text extraction functionality.
  • Integration of natural language processing techniques for query interpretation.
  • Testing and validation of the system with various PDF documents and queries.
  • Documentation of system architecture, functionality, and usage guidelines.

GitHub Link: PDF Query Implementation


Task 10: Table Analysis Using PaddleOCR

Overview

Employ PaddleOCR, an Optical Character Recognition (OCR) toolkit, to extract and analyze tabular data from images or scanned documents.

Implementation Steps:

  1. Table Detection: Develop a pipeline that can accurately detect tables and extract data.
  2. Data Analysis: Perform analysis such as statistical computations or data visualization.

Outcomes:

  • Implementation of a table detection and extraction pipeline using PaddleOCR.
  • Development of algorithms for tabular data analysis, including statistical computations.
  • Integration of data visualization techniques to represent extracted data.
  • Evaluation of pipeline accuracy and performance on various image datasets.
  • Documentation of the process, including code, methodologies, and results.

GitHub Link: Table Analysis Implementation


UVCE,
K. R Circle,
Bengaluru 01