Decision Trees

Mari Sakamoto
6 min readMar 5, 2022

Applied to Kaggle Titanic Challenge with R

Photo by Annie Spratt on Unsplash

What is a Decision Tree?

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

In Machine Learning, we call a model supervised when you have a target variable. Moreover, the decision tree can be used to model both continuous and discrete target variables. The algorithm aims to classify the targeted variable the best way possible through segmentation and by using relevant variables

Basic Algorithm:

  1. Seek the best binary rule for each variable.
  2. Seek to apply the best split among all variables
  3. Recursively, for each sheet, repeat steps 1 and 2 until a stopping rule is reached.

To qualify the best split the algorithm can use two different criteria to maximize the accuracy in the classification: the Gini Index or Shannon entropy.

Gini Index (Ig)

The Gini index Ig for each leaf J in the tree is calculated by adding the squared probability of the target answer pᵢ, in a binary tree there are only two possible answers and subtracting it from 1.

Gini index equation

This index varies from 0 to 0,5. We want to avoid the maximum value representing the most impurity of the probability.

Also, the Gini Index is the default in R because is less computationally demanding.

Shannon Entropy (H)

The Shannon Entropy follows a very similar logic to the Gini Index and is calculated as follow:

Shannon Entropy

It varies from 0 to 1, being the zero entropy the probability of 0 or 100%.

Gini x Entropy

Stopping Rules

Hyper-parameters are the stopping rules for the decision tree to cease its interactions. Including:

  • Minimum number of observations by leaf
  • Maximum depth
  • Complexity Cost (CP): the higher the cost is set, the lower is the complexity
Creating a decision tree in R

Hands-On: Kaggle’s Machine Learning from Disaster

1. The challenge

The challenge consists of predicting the surviving passenger on board of Titanic. Kaggle gives a dataset containing 849 passengers, their outcome (whether they have survived or not) and some information regarding them such as:

  • Pclass: passenger class
  • Age
  • Sex
  • Fare
  • Parch: number of parents or children onboard
  • Embarked: port name that included Cherbourg, Queenstown and Southampton
  • SibSp: number of siblings or spouses onboard

2. Exploring the data

2.1 Categorical variables

From the Titanic dataset, we can explore the categorical variables and analyse their relation to the survival rate for each category.

In the code below, we are doing a barplot with the number of passengers based on the Sex, Pclass, Embarked, SibSp and Parch variables and the percentage of survivors in each category.

Visually, it is possible to observe that female and first-class passengers have a higher survival rate.

2.2. Continuous variables

We can do a similar analysis of the continuous variables by creating bins and analysing the survival rate in each of them. In the code below we are analysing the age of the passengers by creating 20 deciles and also the fare paid by 10 deciles.

Overall, the first bin that contains the passengers aged from 0 to 6 years old had a higher survival rate. As well as the passenger that paid the highest fares (>78 pounds)

3. Building the tree

It is very simple to build a decision tree in R, we can use the part function and declare the wanted variable and its dependent variables.

The part function creates a binary tree in which it will recursively search the most significant segmentation to classify the surviving rate.

At the first level, you can see that 38% of the total population on board of Titanic died. On the second level, it breaks the population by gender cause it has a highly significant relation to the survivors' rate, while women accounted for 35% of passengers and had a 74% chance of surviving, men only had a 19% chance. Then, the other levels keep breaking the tree into binary categories for more significant explanatory variables.

Some other insights on the diagram above:

  • The group that had the best survival rate (95%) was women in the first and second-class;
  • Boys that were younger than 6.5 years old had a 67% chance of surviving
  • Men, older than 6.5 years old would represent 62% of the passengers and only had a 17% of surviving.

4. Applying the model

In order to apply the tree model to a set of test data, we use the predict function. Then we classify the surviving passengers based on the calculated probability, if it is higher than 50%, they would’ve survived.

5. Evaluating the model

5.1. Accuracy

To calculate the accuracy we applied the model to a dataset that contains the real condition of each passenger. Then we compared the model prediction by creating a Confusion Matrix, where the columns represent the real answer to the passenger's demise and the rows the model prediction. In this case, we got 498 true negatives and 244 true positives.

The accuracy is calculated by checking the total of true positives and negatives and divided by the total number of observations:

Accuracy =( TN + TP )/ Total Observations

5.2. Sensitiveness x Specificity

Those indexes are used to diagnose the decision tree model. For some purposes, it might be important to qualify the accuracy of the test for a positive or false result.

Where TP is a true positive FN is a false negative, TN is a true negative and FP is a false positive:

Sensitiveness = TP / (FN +TP)

Specificity = TN / (TN + FP)

5.3. ROC Curve

A ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds.

Source: Wikipedia

In the graph above, the true positive rate is the same as the sensitiveness and the false positive rate is equal to 1 — Specificity. The more the area between the ROC curve and the random classifier is equal to 1 is the better.

In this case, the model got an area under the curve equal to 0,89.

Credits:

This content is based on my class notes from USP ESALQ MBA Data Science & Analytics class.

You can check my Kaagle notebook for the Titanic Challenge here:

https://www.kaggle.com/marisakamoto/titanic-decisiontree

--

--

Mari Sakamoto

Hi! I am a MBA Data Science Candidate from Brazil. Here is my class notes and learning discoveries.