Machine Learning
Topics:
Decision Tree
• Statistics
• Data Manipulation & Visualisa...
• Machine Learning Algorithms
• Machine Learning Projects
• Challenges Winning Approach
• Transfer Learning

# Decision Tree

• Tutorial

### Overview

Decision Tree Analysis is a general, predictive modelling tool that has applications spanning a number of different areas. In general, decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on different conditions. It is one of the most widely used and practical methods for supervised learning. Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

The decision rules are generally in form of if-then-else statements. The deeper the tree, the more complex the rules and fitter the model.

Before we dive deep, let's get familiar with some of the terminologies:

• Instances: Refer to the vector of features or attributes that define the input space
• Attribute: A quantity describing an instance
• Concept: The function that maps input to output
• Target Concept: The function that we are trying to find, i.e., the actual answer
• Hypothesis Class: Set of all the possible functions
• Sample: A set of inputs paired with a label, which is the correct output (also known as the Training Set)
• Candidate Concept: A concept which we think is the target concept
• Testing Set: Similar to the training set and is used to test the candidate concept and determine its performance

### Introduction

A decision tree is a tree-like graph with nodes representing the place where we pick an attribute and ask a question; edges represent the answers the to the question; and the leaves represent the actual output or class label. They are used in non-linear decision making with simple linear decision surface.

Decision trees classify the examples by sorting them down the tree from the root to some leaf node, with the leaf node providing the classification to the example. Each node in the tree acts as a test case for some attribute, and each edge descending from that node corresponds to one of the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new nodes.

Let's illustrate this with help of an example. Let's assume we want to play badminton on a particular day — say Saturday — how will you decide whether to play or not. Let's say you go out and check if it's hot or cold, check the speed of the wind and humidity, how the weather is, i.e. is it sunny, cloudy, or rainy. You take all these factors into account to decide if you want to play or not.

So, you calculate all these factors for the last ten days and form a lookup table like the one below.

Day Weather Temperature Humidity Wind Play?
1 Sunny Hot High Weak No
2 Cloudy Hot High Weak Yes
3 Sunny Mild Normal Strong Yes
4 Cloudy Mild High Strong Yes
5 Rainy Mild High Strong No
6 Rainy Cool Normal Strong No
7 Rainy Mild High Weak Yes
8 Sunny Hot High Strong No
9 Cloudy Hot Normal Weak Yes
10 Rainy Mild High Strong No

Table 1. Obeservations of the last ten days.

Now, you may use this table to decide whether to play or not. But, what if the weather pattern on Saturday does not match with any of rows in the table? This may be a problem. A decision tree would be a great way to represent data like this because it takes into account all the possible paths that can lead to the final decision by following a tree-like structure.

Fig 1. A decision tree for the concept Play Badminton

Fig 1. illustrates a learned decision tree. We can see that each node represents an attribute or feature and the branch from each node represents the outcome of that node. Finally, its the leaves of the tree where the final decision is made. If features are continuous, internal nodes can test the value of a feature against a threshold (see Fig. 2).

Fig 2. A decision tree for the concept Play Badminton (when attributes are continuous)

A general algorithm for a decision tree can be described as follows:

1. Pick the best attribute/feature. The best attribute is one which best splits or separates the data.
4. Go to step 1 until you arrive to the answer.

The best split is one which separates two different labels into two sets.

### Expressiveness of decision trees

Decision trees can represent any boolean function of the input attributes. Let’s use decision trees to perform the function of three boolean gates AND, OR and XOR.

Boolean Function: AND

Fig 3. Decision tree for an AND operation.

In Fig 3., we can see that there are two candidate concepts for producing the decision tree that performs the AND operation. Similarly, we can also produce a decision tree that performs the boolean OR operation.

Boolean Function: OR

Fig 4. Decision tree for an OR operation

Boolean Function: XOR

Fig 5. Decision tree for an XOR operation.

Let’s produce a decision tree performing XOR functionality using 3 attributes:

Fig 6. Decision tree for an XOR operation involving three operands

In the decision tree, shown above (Fig 6.), for three attributes there are 7 nodes in the tree, i.e., for $n = 3$, number of nodes = $2^3-1$. Similarly, if we have $n$ attributes, there are $2^n$ nodes (approx.) in the decision tree. So, the tree requires exponential number of nodes in the worst case.

We can represent boolean operations using decision trees. But, what other kind of functions can we represent and if we search over the various possible decision trees to find the right one, how many decision trees do we have to worry about. Let’s answer this question by finding out the possible number of decision trees we can generate given N different attributes (assuming the attributes are boolean). Since a truth table can be transformed into a decision tree, we will form a truth table of N attributes as input.

X1 X2 X3 .... XN OUTPUT
T T T ... T
T T T ... F
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
F F F ... F

The above truth table has $2^n$ rows (i.e. the number of nodes in the decision tree), which represents the possible combinations of the input attributes, and since each node can a hold a binary value, the number of ways to fill the values in the decision tree is ${2^{2^n}}$. Thus, the space of decision trees, i.e, the hypothesis space of the decision tree is very expressive because there are a lot of different functions it can represent. But, it also means one needs to have a clever way to search the best tree among them.

### Decision tree boundary

Decision trees divide the feature space into axis-parallel rectangles or hyperplanes. Let’s demonstrate this with help of an example. Let’s consider a simple AND operation on two variables (see Fig 3.). Assume X and Y to be the coordinates on the x and y axes, respectively, and plot the possible values of X and Y (as seen the table below). Fig 7. represents the formation of the decision boundary as each decision is taken. We can see that as each decision is made, the feature space gets divided into smaller rectangles and more data points get correctly classified.

Fig 7. Animation showing the formation of the decision tree boundary for AND operation

### The decision tree learning algorithm

The basic algorithm used in decision trees is known as the ID3 (by Quinlan) algorithm. The ID3 algorithm builds decision trees using a top-down, greedy approach. Briefly, the steps to the algorithm are: - Select the best attribute → A - Assign A as the decision attribute (test case) for the NODE. - For each value of A, create a new descendant of the NODE. - Sort the training examples to the appropriate descendant node leaf. - If examples are perfectly classified, then STOP else iterate over the new leaf nodes.

Now, the next big question is how to choose the best attribute. For ID3, we think of the best attribute in terms of which attribute has the most information gain, a measure that expresses how well an attribute splits that data into groups based on classification.

Pseudocode: ID3 is a greedy algorithm that grows the tree top-down, at each node selecting the attribute that best classifies the local training examples. This process continues until the tree perfectly classifies the training examples or until all attributes have been used.

The pseudocode assumes that the attributes are discrete and that the classification is binary. Examples are the training example. Target_attribute is the attribute whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Finally, it returns a decision tree that correctly classifies the given Examples.

ID3(Examples, Target_attribute, Attributes): - Create a root node for the tree. - If all Examples are positive, return the single-node tree root, with positive labels. - If all Examples are negative, return the single-node tree root, with negative labels. - If Attributes is empty, return the single-node tree root, with the most common labels of the Target_attribute in Examples. - Otherwise, begin - A ← the attribute from Attributes that best* classifies Examples - The decision attribute for root ← A - For each possible value $v_i$, of A, - Add a new tree branch below root, corresponding to the test A = $v_i$ - Let Examples_vi be the subset of Examples that have value $v_i$ for A. - If Examples_vi is empty - Then, below this new branch add a leaf node with the labels having the most common value of Target_attribute in Examples. - Else, below this new branch add the subtree(or call the function) - ID3(Examples_vi, Target_attribute, Attributes-{A}) - End - Return root

* Adopted from Machine Learning by Tom M. Mitchell*

*The best attribute is the one with the highest information gain.

### Calculating information gain

As stated earlier, information gain is a statistical property that measures how well a given attribute separates the training examples according to their target classification. In the figure below, we can see that an attribute with low information gain (right) splits the data relatively evenly and as a result doesn’t bring us any closer to a decision. Whereas, an attribute with high information gain (left) splits the data into groups with an uneven number of positives and negatives and as a result helps in separating the two from each other.

Fig 8. Distribution of points in case of high and low information gain.

To define information gain precisely, we need to define a measure commonly used in information theory called entropy that measures the level of impurity in a group of examples. Mathematically, it is defined as:

$Entropy : \sum_{i=1}-p*log_2(p_i)$
$p_i = Probability of class i$

Since, the basic version of the ID3 algorithm deal with the case where classification are either positive or negative, we can define entropy as :

$Entropy(S) = -p_+log_2p_+ - p_-log_2p_-$

where,

S is a sample of training examples

$p_+$ is the proportion of positive examples in S

$p_-$ is the proportion of negative examples in S

To illustrate, suppose S is a sample containing 14 boolean examples, with 9 positive and 5 negative examples. Then, the entropy of S relative to this boolean classification is:

Entropy ([9+, 5-]) = -(9/14)$\cdot log_2$(9/14) - (5/14)$\cdot log_2$(5/14) = 0.940

Note that entropy is 0 if all the members of S belong to the same class. For example, if all members are positive ($p_+$=1), then $p_-$ is 0, and Entropy(S) = -1$\cdot log_2$(1) -0$\cdot log_2$(0) = 0. Entropy is 1 when the sample contains an equal number of positive and negative examples. If the sample contains unequal number of positive and negative examples, entropy is between 0 and 1. The following figure shows the form of the entropy function relative to a boolean classification as $p_+$ varies between 0 and 1.

Fig 9. Entropy function to a boolean classification, as the proportion $p_+$, of positive examples varies between 0 & 1.

Now, given entropy as a measure of the impurity in a sample of training examples, we can now define information gain as a measure of the effectiveness of an attribute in classifying the training data. Information gain, Gain (S, A) of an attribute A, relative to a sample of examples S, is defined as:

$Gain(S, A) \equiv Entrpoy(S) - \sum_{v \in Values(A)} rac{|S_v|}{|S|} \cdot Entropy(S_v)$

where Values(A) is the set of all possible values for attribute the A, and $S_v$ is the subset of S for which attribute A has value v. Note the first term in the equation is just entropy of the original sample S, and the second term is the expected value of entropy after S is partitioned using attribute A, i.e. entropy of its children. Expected entropy described by this second term is simply the sum of entropies of each subset $S_v$, weighted by the fraction of examples $rac{|S_v|}{|S|}$that belong to $S_v$. Gain(S, A) is therefore the expected reduction in entropy caused by knowing the value of attribute A.

In short :

$Information Gain = Entropy(parent node) - [Average Entropy(children)]$

For example, suppose a sample (S) has 30 instances (14 positive and 16 negative labels) and an attribute A divides the samples into two subsamples of 17 instances (4 negative and 13 positive labels) and 13 instances (1 positive and 12 negative labels) (see Fig. 9).

Fig 10. Example of decision tree sorting instances based on information gain.

Let’s calculate the information gain of the attribute A. We know that:

$Gain(S, A) \equiv Entrpoy(S) - \sum_{v \in Values(A)} rac{|S_v|}{|S|} \cdot Entropy(S_v)$

and,

$Entropy(S) = -p_+log_2p_+ - p_-log_2p_-$

Entropy of parent = Entropy(S) = -$rac{14}{30}\cdot log_2 rac{14}{30}$ - $rac{16}{30}\cdot log_2 rac{16}{30}$ = 0.996

Entropy of child with 17 instances = Entropy($S_1$) = -$rac{13}{17}\cdot log_2 rac{13}{17}$ - $rac{4}{17}\cdot log_2 rac{4}{17}$ = 0.787

Entropy of child with 13 instances = Entropy($S_2$) = -$rac{1}{13}\cdot log_2 rac{1}{13}$ - $rac{12}{13}\cdot log_2 rac{12}{13}$ = 0.391

(Weighted) Average Entropy of children = $rac{17}{30} \cdot 0.787$ + $rac{13}{30} \cdot 0.391$ = 0.615

Information Gain = G(S, A) = 0.996 - 0.615 = 0.38

Similarly, we can calculate the information gain for each attribute (from the set of attributes) and select the attribute with highest information gain as the best attribute to split upon.

### Coding a decision tree

We will use the scikit-learn library to build the decision tree model. We will be using the iris dataset to build a decision tree classifier. The data set contains information of 3 classes of the iris plant with the following attributes: - sepal length - sepal width - petal length - petal width - class: Iris Setosa, Iris Versicolour, Iris Virginica

The task is to predict the class of the iris plant based on the attributes. Link to data.

#Importing required libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


The scikit-learn dataset library already has the iris dataset. You can either use the dataset from the source or import it from the scikit-learn dataset library.

#Loading the iris data
print('Classes to predict: ', data.target_names)


There are three classes of iris plants: 'setosa', 'versicolor' and 'virginica'. Now, we have imported the iris data in the variable 'data'. We will now extract the attribute data and the corresponding labels. We can extract the attributes and labels by calling .data and .target as shown below:

#Extracting data attributes
X = data.data
### Extracting target/ class labels
y = data.target

print('Number of examples in the data:', X.shape[0])


There are 150 examples/ samples in the data. The variable 'X' contains the attributes to the iris plant. The cell below shows the 4 attributes of the first four iris plants.

#First four rows in the variable 'X'
X[:4]

#Output
Out: array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2]])


Now that we have extracted the data attributes and corresponding labels, we will split them to form train and test datasets. For this purpose, we will use the scikit-learn's 'train_test_split' function, which takes in the attributes and labels as inputs and produces the train and test sets.

#Using the train_test_split to create train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = 0.25)


Since, this is a classification problem, we will import the DecisionTreeClassifier function from the sklearn library. Next, we will set the 'criterion' to 'entropy', which sets the measure for splitting the attribute to information gain.

#Importing the Decision tree classifier from the sklearn library.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion = 'entropy')


Next, we will fit the classifier on the train attributes and labels.

#Training the decision tree classifier.
clf.fit(X_train, y_train)

#Output:
Out:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')


Now, we will use the trained classifier/ model to predict the labels of the test attributes.

#Predicting labels on the test set.
y_pred =  clf.predict(X_test)


We will now evaluate the predicted classes using some metrics. For this case, we will use 'accuracy_score' to calculate the accuracy of the predicted labels.

#Importing the accuracy metric from sklearn.metrics library

from sklearn.metrics import accuracy_score
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on test data: ', accuracy_score(y_true=y_test, y_pred=y_pred))

#Output:
Out: Accuracy Score on train data:  1.0
Accuracy Score on test data:  0.9473684210526315


Next, we will tune the parameters of the decision tree to increase its accuracy. One of those parameters is 'min_samples_split', which is the minimum number of samples required to split an internal node. Its default value is equal to 2 because we cannot split on a node containing only one example/ sample.

clf = DecisionTreeClassifier(criterion='entropy', min_samples_split=50)
clf.fit(X_train, y_train)
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on the test data: ', accuracy_score(y_true=y_test, y_pred=clf.predict(X_test)))

#Output:
Out: Accuracy Score on train data:  0.9553571428571429
Accuracy Score on test data:  0.9736842105263158


We can see that the accuracy on the test set increased, while it decreased on the training set. This is because increasing the value of the min_sample_split smoothens the decision boundary and thus prevents it from overfitting. You may tune other parameters of the decision tree and check how they affect the decision boundary in a similar way. You can check the other parameters here.

### Issues in decision trees

#### Avoiding overfitting

Since the ID3 algorithm continues splitting on attributes until either it classifies all the data points or there are no more attributes to splits on. As a result, it is prone to creating decision trees that overfit by performing really well on the training data at the expense of accuracy with respect to the entire distribution of data.

There are, in general, two approaches to avoid this in decision trees: - Allow the tree to grow until it overfits and then prune it. - Prevent the tree from growing too deep by stopping it before it perfectly classifies the training data.

A decision tree’s growth is specified in terms of the number of layers, or depth, it’s allowed to have. The data available to train the decision tree is split into training and testing data and then trees of various sizes are created with the help of the training data and tested on the test data. Cross-validation can also be used as part of this approach. Pruning the tree, on the other hand, involves testing the original tree against pruned versions of it. Leaf nodes are removed from the tree as long as the pruned tree performs better on the test data than the larger tree.

#### Incorporating continuous valued attributes

Our initial definition of ID3 is restricted to attributes that take on a discrete set of values. One way to make the ID3 algorithm more useful with continuous variables is to turn them, in a way, into discrete variables. Let’s say in our example of Play Badminton the temperature is continuous (see the following table), we could test the information gain of certain partitions of the temperature values, such as temperature > 42.5. Typically, whenever the classification changes from no to yes or yes to no, the average of the two temperatures is taken as a potential partition boundary.

Day Weather Temperature Humidity Wind Play?
1 Sunny 80 High Weak No
2 Cloudy 66 High Weak Yes
3 Sunny 43 Normal Strong Yes
4 Cloudy 82 High Strong Yes
5 Rainy 65 High Strong No
6 Rainy 42 Normal Strong No
7 Rainy 70 High Weak Yes
8 Sunny 81 High Strong No
9 Cloudy 69 Normal Weak Yes
10 Rainy 67 High Strong No

Table 2. Obeservations of the last ten days.

Because 42 corresponds to No and 43 corresponds to Yes, 42.5 becomes a candidate. If any of the partitions end up exhibiting the greatest information gain, then it is used as an attribute and temperature is removed from the set of potential attributes to split on.

#### Alternative measures for selecting attributes

The information gain formula used by ID3 algorithm treats all of the variables the same regardless of their distribution and their importance. This is a problem when it comes to continuous variables or discrete variables with many possible values because training examples may be few and far between for each possible value, which leads to low entropy and high information gain by virtue of splitting the data into small subsets but results in a decision tree that might not generalize well.

One way to avoid this is to use some other measure to find the best attribute instead of information gain. An alternative measure to information gain is gain ratio (Quinlan 1986). Gain ratio tries to the correct the information gain’s bias towards attributes with many possible values by adding a denominator to information gain called split information. Split Information tries to measure how broadly and uniformly the attribute splits the data:

$SplitInformation(S, A) = - \sum_{i=1}^{c} rac{|S_i|}{|S|} \cdot log_2 rac{|S_i|}{|S|}$

The Gain Ratio is defined in terms of Gain and SplitInformation as,

$Gain Ratio(S, A) \equiv rac{Gain(S, A)}{SplitInformation(S, A)}$

One practical issue that arises in using gain ratio in place of information gain is that the denominator can be zero or very small when $|S_i|pprox|S|$ for one of the $S_i$. This either makes the Gain ratio undefined or very large for attributes that happen to have the same value for nearly all members of S.For example, if there’s just one possible value for the attribute, then the formula equals $log_2$1 = 0.Luckily, we tend not to include attributes with 1 possible value in our training data because it is impossible to carry out the ID3 algorithm by splitting on an attribute with only 1 value, so Gain ratio doesn’t have to handle the possibility of a denominator of 0. On the other hand, our continuous temperature example has 10 possible values in our training data, each of which occur once, which leads to -(1/10)$\cdot log_2$(1/10) = $log_2$10 . In general, the SplitInformation of an attribute with n equally distributed values is $log_2n$. These relatively large denominators significantly affect an attribute’s chances of being the best attribute after an iteration of the ID3 algorithm and help in avoiding choices that perform particularly well on the training data but not so well outside of it.