Decision tree is a non-parametric supervised machine learning algorithm which makes use of a tree like structure to make decisions. Decision tree algorithms can be used for both classification and regression though it is mostly meant for classification tasks.

Decision trees consist of three types of nodes such as root node, leaf nodes and decision nodes. The first node in the graphical representation (Fig.1) is called root node. Decisions are made at the decision node and the terminating node of the tree-like structure is called leaf node. Leaf nodes are in effect the output of decisions made at decision nodes and so there won’t be any further extension from these leaf nodes.

Figure 1. Sample decision tree

Decision tree algorithm works with a nested if-else condition. The graphical representation is similar to that of a flow chart.

## Demonstration of decision tree with example

Day | Weather forecast | Temperature | Humidity | Play cricket |

1 | Sunny | Low | High | Yes |

2 | Clear sky | Low | Low | Yes |

3 | Rainy | Low | High | Yes |

4 | Sunny | High | High | No |

5 | Rainy | Low | Low | No |

6 | Clear sky | Low | High | Yes |

*Figure 2. Example decision tree*

As there is no uncertainty to make a decision when the weather forecast is a clear sky, we need not split the tree further. We have to get familiar with terms such as entropy, information gain and Gini index to interpret the decision tree when there is uncertainty or disorderliness associated with the training dataset. The following sections discuss the aforementioned concepts.

## Entropy

Entropy is a measure of randomness or disorder in the data. Suppose if you want to go for an outing with a group of 11 people together and you have three choices of places such as Ooty, Wayanad and Chikmagalur. Out of 11 among you, consider a situation in which 4 people want to go to Ooty, 5 prefer to go Wayanad and 2 give their preferences as Chikkamagaluru.

Now, it will be difficult to decide between Ooty and Wayanad as the number of people who prefer to go to these places are close in number. We can make a decision of going to Ooty based on the majority vote. But there is a disorderliness associated with this decision.

Entropy measures the disorderliness using the mathematical equation,

Entropy, $E= -\Sigma P_i log(P_i)$

$P_i$ is the probability for each choice.

The entropy associated with the choice of Ooty and Wayanad using the above equation is,

$E=-\frac{4}{11} log (\frac{4}{11})-\frac{5}{11} log (\frac{5}{11})$

$= -0.3636 \hspace{1mm} log (0.3636)- 0.4545 \hspace{1mm} log (0.4545)=0.1597+0.1556=0 .3153$

This means that there is an impurity with a value of 0.3153 associated with the node where the split to 4 and 5 happens.

On the other hand let’s examine the choice with Wayanad and Chikamagallur,

$E=-\frac{5}{11} \hspace{1mm} log (\frac{5}{11}) -\frac{2}{11} \hspace{1mm} log(\frac{2}{11})$

$=-0.4545 \hspace{1mm} log (0.4545)-0.18 \hspace{1mm} log(0.18)=0.1556+0.1346= 0.2902$

This entropy value of 0.2902 which is less than 0.3153 indicates that the impurity at the node where the split between 5 and 2 happens is less and it is easier to make a decision between the choice among 2 and 5 as compared to 4 and 5.

Though we got an idea about the impurity at the decision node in the two situations discussed, still there is less clarity about whether entropy at a succeeded node decreased or not as compared to the previous nodes or root node.

The entropy at a node just before the split into two decisions such as yes or no happens is called parent entropy or entropy at the parent node. The nodes which follow the parent node are called child nodes.

The metric called *information gain* can quantify the amount of entropy change when proceeding from a parent node to child node.

## Information Gain (IG)

Information gain is the difference between the entropy at the parent node and weighted average of the entry at the corresponding child nodes.,

i.e. IG = Entropy (parent node) – weighted average entropy of child nodes

This gives information about which feature is helpful to reduce the entropy. We can compare the IG of each feature and select the feature with more information gain as the root node and the split will follow from that node to make a decision.

The algorithm runs recursively over the subset of data after making the split of the data based on the significance of each feature at each node. This helps to reduce the number of iterations to arrive at a decision.

In order to overcome the computational complexity associated with log in entropy calculation we can use another parameter called Gini impurity.

## Gini Impurity

The mathematical formulation of Gini impurity is,

$I_G= 1 – \Sigma {P_i}^2$

Gini impurity and entropy have similar physical interpretations. However Gini impurity is preferred over entropy due to its computational efficiency.

## Overfitting and underfitting

If the depth of the decision tree is more, there will be less number of data points at a node. This increases the chance of overfitting and the model becomes less interpretable.

In order to avoid the chances of overfitting, it is possible to cut down the nodes or sub nodes which are insignificant. This process of removing the branches with less or no significance is called pruning.

On the contrary, if the depth of the decision tree is less, the chances of underfitting increases. A decision tree with depth equal to one is treated as a dump model and called a decision stump.

Depth of a decision tree can be determined by cross-validation.

## Regression using decision tree algorithm

In place of information gain used in classification, regression uses mean square error and mean absolute square error. Whichever feature gives the lowest weighted error is selected as the most useful feature for performing regression.

### Advantages of decision tree based models

- It is easy to visualise and interpret and so it is called as a white box model
- Cost of prediction is logarithmic in the number of training data
- Less data preparation is required
- Performs well even if the assumptions drawn at training stage are not correct
- It can handle both categorical and numerical data
- It is possible to validate decision tree models using statistical tests. This ensures reliability of the model

### Disadvantages of decision tree based models

- It is prone to overfitting
- Small variation in the data can make the decision tree unstable
- Not an effective choice to predict the outcome of a continuous variable