cover photo

COURSEWORK

Vibha's AI-ML-001 course work. Lv 2

Vibha HAUTHORACTIVE
This Report is yet to be approved by a Coordinator.

Level-2 Report

4 / 1 / 2025


Level-2 Tasks

Task 1: ID3 Decision Tree

I used Kaggle (as usual) to implement this decision tree using ID3 algorithm, which uses two basic functions to decide what features to select for branching/rooting and therefore, the final prediction. These functions are entropy (randomness in the values/answers) and information gain (amount of entropy reduced from one level to the next). These are defined by:
Information gain:
IG function
Entropy:
Ent function
Code:


Task 3: Ensemble Techniques

An ensemble simply means a collection or an arrangement, and in this case, of various ML models (could be regression, decision trees, or other classifiers). There 3 main ways to combine/arrange these models together in order to get accurate outputs. These are:

  1. Bagging (Bootstrap Aggregating): Using a number of models, giving each model a different chunk of the target dataset for training (then replacing this data so it may or may not be picked again), then aggregating the results by either voting (for classification) or averaging (for regression) to come to a solution.
  2. Boosting: While bagging involves models working parallely, boosting is a sequential technique. It starts with a weak model, and the information from this is fed to another model, which will then correct the previous one's errors. And this repeats for a bunch of models until an appropriately accurate solution is obtained, by taking a weighted sum of all the outcomes, or whatever method suits best.
  3. Stacking: Here, we'll use various models to predict a certain output from a dataset. Then, we'll combine all of these outputs using another model, to make one final prediction, leveraging the strengths of all the models we used.
    The last two techniques are, however accurate, more prone to overfitting. So we'll need to look at our specific needs and then decide on what technique to use. Below is the infamous Titanic dataset, on which we've used these three methods to make predictions.

Code:


Task 7: Anomaly Detection

This task was actually quite interesting. We essentially created some random 2D dataset, with around 200 values belonging to two classes, scattered more or less randomly. Then, we set a decision boundary in the form of a threshold value. Value for what? Anomaly scores! These are scores assigned to each point in the dataset, which determines their "normalness" with respect to all the other points in that dataset. If the score of a certain point is beyond the threshold, it counts as a true outlier (conversely, a true inlier). This was fun!
Also, for curiosity's sake, if we want to calculate this stuff manually, our anomaly score could be the Euclidian distance b/w our target point and some reference. Our threshold can be whatever, but usually would be the 80-90th percentile.

Code:


Task ?: ???

UVCE,
K. R Circle,
Bengaluru 01