cover photo

COURSEWORK

Pallavi's AI-ML-001 course work. Lv 3

Pallavi coAUTHORACTIVE
This Report is yet to be approved by a Coordinator.

LEVEL -03

30 / 3 / 2025


Task 1 - Decision Tree based ID3 Algorithm

The ID3 (Iterative Dichotomiser 3) algorithm is a popular decision tree classifier that builds trees using a top-down, greedy approach. It selects attributes based on information gain, ensuring the most informative features are chosen first. This method is particularly useful for classification tasks, offering a clear and interpretable decision-making process.

ID3 Algorithm


Task 2 - Naive Bayesian Classifier

The Naive Bayesian Classifier is a statistical model that utilizes Bayes' theorem to predict class labels based on given features. It assumes that all features contribute independently to the outcome. Due to its computational efficiency and robust performance, it is widely applied in email filtering, document categorization, and other classification tasks.

Naive Bayesian


Task 3 - Ensemble Techniques

Ensemble techniques in machine learning combine multiple models to improve overall predictive performance. Instead of relying on a single model, these methods aggregate the predictions of multiple models, leading to better accuracy, stability, and robustness.

Types of Ensemble Techniques

  1. Bagging (Bootstrap Aggregating) – Reduces variance by training multiple models on random subsets and aggregating predictions.
  2. Boosting – Improves weak models sequentially, correcting previous errors to reduce bias.
  3. Stacking – Combines predictions from multiple models using a meta-model for final output.
  4. Voting & Averaging – Aggregates predictions from different models using majority voting or averaging.

Ensemble Techniques


Task 4 - Random Forest, GBM, and XGBoost

Random Forest

Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes (classification) or the mean prediction (regression). It reduces overfitting and improves accuracy.

Gradient Boosting Machines (GBM)

GBM is a boosting algorithm that builds models sequentially, where each model corrects the errors of the previous one. It minimizes both bias and variance, making it highly effective for structured data.

XGBoost (Extreme Gradient Boosting)

XGBoost is an optimized version of GBM that is designed for speed and efficiency. It includes advanced regularization techniques to prevent overfitting and is widely used in machine learning competitions.

Link


Task 5 - Hyperparameter Tuning

Hyperparameter tuning is the process of selecting the best hyperparameters for a machine learning model to improve its accuracy and generalization. Unlike model parameters, hyperparameters are set before training and impact the learning process.

Techniques for Hyperparameter Tuning

  • Grid Search: Tries all possible combinations of hyperparameters.
  • Random Search: Randomly selects combinations for evaluation.
  • Bayesian Optimization: Uses probabilistic models to find the best parameters.
  • Genetic Algorithms: Applies evolutionary strategies to optimize hyperparameters.

Hyperparameter Tuning


Task 6 - Image Classification using KMeans Clustering

  • KMeans Clustering is an unsupervised learning algorithm used for grouping similar data points into clusters based on feature similarity.
  • The algorithm works by:
    1. Choosing K cluster centroids randomly.
    2. Assigning each data point to the nearest centroid.
    3. Updating centroids by calculating the mean of assigned points.
    4. Repeating until centroids no longer change significantly.
  • Application in Image Classification:
    • It is used for image segmentation, color quantization, and pattern recognition.
    • Helps in grouping pixels with similar color intensities into clusters, improving computational efficiency.
    • Works best when K is appropriately chosen, but struggles with complex, high-dimensional images.

KMeans Clustering


Task 7 - Anomaly Detection

Anomaly Detection identifies unusual patterns in data that deviate from normal behavior, crucial for fraud detection, cybersecurity, and system monitoring. It includes point anomalies (single unusual data points), contextual anomalies (depends on the situation), and collective anomalies (group anomalies). Detection methods include statistical approaches (Z-score, IQR), machine learning (KMeans, Isolation Forest), and deep learning (Autoencoders).

Anomaly Detection


Task 8: Generative AI Using GANs

Generative Adversarial Networks (GANs) consist of a generator (creates fake images) and a discriminator (detects fake images), competing to improve realism. GANs are widely used in image synthesis, style transfer, and data augmentation. Training involves datasets like CIFAR-10 and CelebA, and implementations can be done using PyTorch or TensorFlow. The goal is to generate high-quality, diverse synthetic images.

GAN


Task 9: PDF Query Using LangChain

Utilize LangChain, a natural language processing framework, to extract relevant information from PDF documents based on user queries. The system will interpret queries, process PDFs, and retrieve relevant excerpts using NLP techniques. Key components include PDF parsing, text extraction, and query interpretation. Implementation can be done using Python libraries like LangChain and PyPDF. The goal is to develop an efficient document querying system for structured information retrieval.

PDF Query


Task 10: Table Analysis Using PaddleOCR

Utilize PaddleOCR, an Optical Character Recognition (OCR) toolkit, to extract and analyze tabular data from images or scanned documents. The pipeline will detect tables, extract data, and perform statistical computations or data visualization. The implementation includes table detection, data extraction, and analysis using Python libraries like Pandas and Matplotlib. The goal is to improve accuracy in extracting structured data from images for better analysis.

Table Analysis Using PaddleOCR

UVCE,
K. R Circle,
Bengaluru 01