RESOURCE · 3/6/2025

Tasks 9 to 11

Anagha Raghavan

This Article is yet to be approved by a Coordinator.

Task 9 - Data Visualization for Exploratory Data Analysis

Plotly – It is a powerful library in python used for data visualization which is much more dynamic than Matplotlib or Seaborn. I explored many functions in this library like the scatter, line, bar,3d, choropleth (a thematic map) on the gapminder dataset.

Task 10: An introduction to Decision Trees

Decision trees- They are supervised learning algorithm for regressive or classifying task.

Classification- Simply put the decision tree is a bunch of true or false or if-else statements to classify the data according to the condition. Now, with the help of conditions we keep splitting the nodes; first from the root node to the impure node and further until we end up with a pure node or leaf node. It must be noted that not all ways of splitting can get us pure leaf nodes at the end. Also how does the model learn which is the best split?

There are many methods to find this out including gini impurity, information gain etc.

Gini impurity- This is the formula for gini impurity Where c is the number of classes . Once the impurity is found for both left and right child in a step, their weighted average is taken. The weighted average of different classification is found out, to choose the best split. Lesser the impurity better is the classification.

Entropy- The entropy is calculated by the following formula

And then the information gain.

Wi is the weight.

Higher the information gain better is the split.

Regression- Here the split is done not to classify the data , but to get continuous values. Predicts the mean value of the target variable in each leaf node. The best split is the one with the minimum value of MSE.

Task 11: SVM

SVM- It is a Machine learning algorithm used mainly for classification tasks. It works by finding a hyperplane that best separates data points.

Margin- It is the shortest distance between the observation and threshold. Threshold is the separating line or a plane (Hyperplane). These observations or data points closer to the hyperplane are called support vectors. Larger the margin, the better the generalization (bettwer the performance) on unseen data.

Soft margin- Some times misclassification of the data with respect to the hyperplane can be done so as to allow better performance of the model. There is a parametr called as cross validation that tells how many miss classifications of the support vectors are allowed inside the soft margin so that the model can perform better.

The kernel trick- When the data is not linearly separable SVM moves or transforms the data to a higher dimension where a linear separator can exist. Examples include the polynomial kernel, radial kernel, linear kernel etc.

Down below I have using the concept of Support Vector Machines, detected the possibility of breast cancer.

First I loaded the data
Then I scaled the data so that all features contribute fairly to the model and the decision boundary is more accurate
Then I imported the SVC and applied it to training data and tested it on the test data and printed the classification report.

Tasks 9 to 11

Task 9 - Data Visualization for Exploratory Data Analysis

Task 10: An introduction to Decision Trees

Task 11: SVM

Social Media

Useful links