Decision Trees — Just the everyday things!

KDAG IIT KGP
16 min readSep 8, 2021

“Waiting hurts. Forgetting hurts. But not knowing which decision to take can sometimes be the most painful…”

In this article, we shall be covering the following:

  1. Introduction
  2. Components of a Decision Tree
  3. Classification Trees
  4. Regression Trees
  5. Exploring the Algorithm
  6. Entropy, Gini Index, Variance
  7. Building a Decision Tree Classifier and Regressor using scikit-learn library
  8. Overfitting & Pruning
  9. Pre-pruning & Post-pruning
  10. Advantages & Disadvantages
  11. Conclusion

Let’s get started!

What am I?

Imagine a scenario where you are a property broker and have a house price list with given characteristics- number of bedrooms, area, electricity and water availability, etc. What if someone calls you up to make a deal about a new house that isn’t on your house price list? Is there a way in which you can predict the price of this new house with your existing data?

Well, you’ll be glad that there are Decision Trees to the rescue. Here is an example of a very basic Decision Tree.

A Decision Tree is a tree-based supervised machine learning algorithm that creates a clear dichotomy between the choices at hand using nodes to divide and leaves to offer a conclusion based on those choices.

Components

Root: The node at the top of the tree (1️⃣), which kickstarts the chain of questions and decisions is called a root.

Internal node: Each node (2️⃣ ,3️⃣ ,5️⃣ ) is a junction that represents a test on one of the attributes/ features and is followed by a branch that represents the outcome of the test. The convention is that the branch to the right is for evaluation of the test as true and the left for false. Each branch leads either to another node or a leaf.

Leaf: A leaf (4️⃣ ,6️⃣ ,7️⃣ ,8️⃣ ,9️⃣ )is the final step in any Decision Tree algorithm and it represents the decision taken after computing all the features that concern it. They are also referred to as terminal nodes.

Let’s delve deeper into the subject of Decision Trees which are broadly divided into Classification Trees, which are used for object classification, and Regression Trees, which are used for predicting continuous values.

Classification Trees

Suppose a company is about to launch a product and would like to have maximum reach in a fixed marketing budget, i.e, they need to identify their target demographic. A Decision Tree is incredibly helpful in such situations as it literally models out how the human brain thinks by picking one of two options at hand and refining it further by recursive binary classification to reach a conclusion. Also, given the simplicity of the Decision Tree, they could use their sales data from previous products and data of other similar products in the market to model the various parameters that could impact their sales.

Let’s take an example to understand how it works. For the sake of simplicity, we would be classifying the person as simply fit or unfit (which might be a bit crude but bear with me).

We start off with the question at the root node, which is whether the person is younger than 30 years? If yes, the next attribute to be considered is the abundance of pizza in their diet. If that evaluates to be true, they are unfit, otherwise, they are deemed fit.

Let us look back to see what path we’d have gone down if the person was older than 30 years. The next feature to be considered in that case would be whether or not they exercise in the morning, if they do, they are deemed fit otherwise unfit. You saw how simple that was? This intuitive nature of Decision Trees and their easy visualization is what makes it so popular.

For any classification tree model, the choice of attributes and their relative position is done in such a way that we can claim our prediction with the maximum possible probability. The nodes with questions of higher importance are placed closer to the root node and various methods and parameters are affecting this, which will be elaborated on later on.

Regression Trees

You must have got a pretty good idea about how classification trees work and moving a little further with a little more calculation, you can learn about regression tree algorithm. The only difference between the classification tree and the regression tree is that the former has a boolean value and the latter numeric. In the classification tree, the model segregates based on the “yes or no” value provided and that’s pretty easy to understand and code too. But in the regression tree, we have a vast range of values to consider of the feature with numeral value and then to classify them accordingly. So we bring in a limiting value that decides the flow of the classification. If a parameter evaluates to greater than a threshold at a node, decide to go right in the tree or left (KDAG is right and nothing is left :-))

So to get the best classifying line, we have to bring in some function to quantize the value of each line telling us how ‘well’ it can classify the data which will help us pick the best one among them. This can be done by finding the deviation of this function with respect to these lines and selecting the one with the least error. So that raises the question “are we really going to check for each and every value present in the axis“ pssst definitely not.

Sit tight as we begin with the algorithm itself!

Exploring the Algorithm

First, sort the dataset with respect to a feature and then get the average value of the consecutive samples which will be our potential candidates for the classifiers. As this line graphically splits the dataset into two parts, now we can check how well the line has classified the dataset. To check, we first take the average of the two groups individually and find how much it has deviated from its actual value by taking the sum of squares of difference (technically called the “sum of squares of residuals” abbreviated as SSR) from its actual value to that of average value, we then calculate this for all the lines and select the one with least error. Out of the two samples obtained in classification, we select the one with the lower value of SSR and continue this process until the last nodes(leaves) of sufficient purity. You can perform this partitioning as much as you want but it would result in overfitting which is a common dilemma in machine learning (bias-variance tradeoff…don’t worry, we’ll explain it in the subsequent sections) which only works good for the training dataset but will be of problem to the testing dataset. You can read about it in the subsequent parts.

Providing a suitable example would make you understand it even better, the following small data will be used to do so.

From the above dataset, we get the potential classifiers as 166, 168, 170, 171.5, 172.5. We will show here the way to calculate the SSR for just one line and that can be done to all of them, from which we can get the line with the least error.

Let’s take the value 168, then

the average of 1st class = (4+4)/2 = 4

and the average of 2nd class = (5+9+9+10)/4 = 8.25

SSR = (4–4)²+(4–4)²+…..(10–8.25)²

Finding all the SSRs and comparing the values we will get 170 as the best fit line to classify the dataset. So this can be extended to as many classes you want and here you go, you have as such learned the regression tree. The regression tree works well even for many classes in the dataset but can easily overfit the dataset and one has to be careful about it. In the coming part, we will discuss how to preprocess the data and remove unnecessary stuff.

Entropy, Gini & Variance

With more than one attribute taking part in the decision-making process in a Decision Tree, it is necessary to decide the relevance and importance of each of the attributes. Thus, placing the most relevant at the root node and further traversing down by splitting the nodes. Hence the major challenge is the identification of this attribute for the parent node in each level. This process is known as attribute selection. The following is an example of how a group of students playing cricket can be split based on their gender, height, and class features.

We need to select, splitting the students based on which feature will give us the best homogeneous set of population. For doing that, the decision tree has various criteria. Let us look at a few popular attribute selection methods used by Decision Trees:

Entropy

We all know that in thermodynamics, entropy means the measure of disorder. In the machine learning world, it can be defined as a measure of the purity of a sub-split. The Decision Tree can be partitioned into smaller subsets and entropy acts as the threshold value for each tree node and decides which parameter/feature of the data will result in the best split.

The mathematical formula for the calculation of entropy is as follows:

where the probability, P(i), denotes the fraction of examples, of a given class i.

Basically, entropy is the negative sum of the probability for each event multiplied by the log of the probability for the event. Since probability falls in the range of 0–1 the summed value will always be negative thus it needs to be multiplied by -1.

Higher values of entropy imply a high level of impurity. The maximum impurity for a data set is achieved when there is an equal division of all the unique values in the data set.

Steps to split a Decision Tree using Entropy:

  • For each split, we individually calculate the entropy of each child node,
  • Then we calculate the entropy of each split as the weighted average entropy of child nodes
  • We select the split with the lowest value of Entropy to be used as the splitter for the dataset
  • We keep on repeating steps 1–3 until we get homogeneous nodes (or nodes of sufficient purity)

Below is a Decision Tree performing classification on Iris Dataset using the ‘entropy’ criterion.

Gini Index

Gini index or Gini impurity measures the degree of probability of a particular variable being wrongly classified when it is randomly chosen. The value of the Gini index varies from 0 to 1.

  • 0 indicates all elements belong to one specified class or there exists only one class
  • 1 indicates all elements are randomly distributed across various classes
  • 0.5 indicates elements are equally distributed over some classes

The mathematical formula for calculating the Gini index is as follows:

Here Pi is the probability of an object being classified to a particular class i. While building a Decision Tree, we would prefer choosing the attribute/feature with the least Gini index as the root node/parent node.

The steps to split a Decision Tree using Gini Impurity:

  • For each split, we individually calculate the Gini Impurity of each child node
  • Then we calculate the Gini Impurity of each split as the weighted average Gini Impurity of child nodes
  • We select the split with the lowest value of Gini Impurity to be used as the splitter for the data set
  • We keep on repeating steps 1–3 until we get homogeneous nodes.

Below is a Decision Tree performing classification on Iris Dataset using the ‘gini’ criterion.

Comparison between Gini Index and Entropy

The internal working of both methods is very similar and both are used for computing the feature/split after every new splitting. But if we compare both the methods then Gini Impurity is more efficient than entropy in terms of computing power. Let’s plot the graph of entropy and Gini impurity against probability:

As one can see in the graph for entropy, it first increases up to 1 and then starts decreasing, but in the case of Gini impurity it only goes up to 0.5 and then it starts decreasing, hence it requires less computational power. The range of Entropy lies between 0 to 1 and the range of Gini Impurity lies between 0 to 0.5. While calculating too, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster.

Hence we can conclude that in general, Gini Impurity is a better criterion as compared to entropy for selecting the best features.

Variance

Reduction in Variance is a method for splitting the node used when the target variable is continuous, i.e., in case of regression problems. It is so-called because it uses variance as a measure for deciding the feature on which the node is split into child nodes. The mathematical formula for variance is:

Here, X is the actual value, 𝛍 is the mean of all values and N is the total number of values.

Variance is used for calculating the homogeneity of a node. If a node is entirely homogeneous, then the variance is zero. Value of variance increases when the dataset becomes non-homogeneous leading to impure nodes.

The steps to split a Decision Tree using variance:

  • For each split, we individually calculate the variance of each child node
  • Then we calculate the variance of each split as the weighted average variance of child nodes
  • We select the split with the lowest value of variance to be used as the splitter for the data set
  • We keep on repeating steps 1–3 until we get homogeneous nodes.

Now let’s get our hands dirty with some coding!

Let’s Code a bit!

1. Building a Decision Tree Classifier using scikit-learn

(the link to the dataset on which we are working can be found here)

2. Building a Decision Tree Regressor using scikit-learn

(the link to the dataset on which we are working can be found here)

Overfitting and Pruning

By now you must’ve gotten a decent idea about the working of Decision Trees. Now let us look at how to improve their performance. If the Decision Tree grows without any restriction then it will end up giving a very high accuracy score for the training data set, which might sound like a good thing after all we want our model to learn from the data. However, while doing so, it just might happen that quite a few nodes will split with very little data at their disposal, just to reduce the loss and increase the information gain. This may result in wrong and unreliable predictions for future data sets. This situation is called “Overfitting’’.

In overfitting, the machine learning model predicts the outcomes of the training dataset almost perfectly, but it may fail to do so for any other dataset and may end up with erroneous predictions. We say that such a model has low bias and high variance. Here, bias conveys whether or not any strong assumption has been made regarding the model. Variance gives a measure of the degree to which the model is based on the training dataset.

Underfitting and overfitting are two opposing situations as shown here:

For an ideal fit, we would like to keep both the variance and the bias low. Since in overfitting, the model memorizes the outcomes of the training set, thus we hardly make any assumptions regarding the data (Low bias), however, while fitting the data we also fit the noise in the data to the model, thus the model relies heavily on the training data (high variance). Similarly, in underfitting, the model fits the data points by taking into consideration a lesser number of features, which in turn leads to a high bias and low variance situation.

As can be seen from the graph, there exists a sweet spot in between the underfitting and overfitting regions that give us an optimal complexity.

An example of overfitting and underfitting

Striking the right balance

There are several methods to prevent the model from overfitting the training data and we’ll look at one of the widely used methods known as “Pruning”. As the name suggests, we will prevent the tree from becoming unnecessarily large.

The two kinds of pruning are — (a)Pre-pruning and (b)Post-pruning.

Pre-pruning

In pre-pruning, we first impose restrictions and conditions on the tree to grow and the nodes to split and then allow the model to fit the data. For Pre-pruning we can use several parameters that are available with Scikit learn’s DecisionTreeRegressor model. Some of these parameters (AKA hyperparameters) are:

  • max_depth: Restricts the maximum tree size.
  • min_samples_split: Assigns a threshold value for the minimum number of sample data that a node must have to split
  • max_leaf_nodes: Restricts the maximum number of leaves a tree can have.
  • min_impurity_decrease: This allows a node to split if the decrease in impurity after splitting is greater than the threshold assigned.

There are a few other parameters that can be tweaked to prevent overfitting. Tweaking these parameters is also called “Hyperparameter tuning”. The optimal values of the above parameters can be found by using some strategy (such as Grid Search) and the best value for a parameter can be assigned either by simply using the values on the test data or by cross-validation. An initial guess for some of the parameters can be obtained directly from the original overfit tree.

Post-pruning

Apart from pre-pruning, we can also go for post-pruning which may give comparably better results. In post-pruning we first allow the tree to grow without any restrictions, thus resulting in overfitting, and once the Decision Tree model is obtained we start pruning the tree. One of the most common methods for such pruning is “Cost Complexity Pruning”. Let’s understand this in some more detail:

Let us define a function R, which calculates the Sum Squared Error of the predicted values and actual values for the given data points.

where yⱼ= actual value from the data set, and pⱼ= predicted value

For the training data set a model facing overfitting will give a comparatively smaller value of R, whereas the test data set will give a higher value. Thus the value of R itself is not a very good way to decide whether the model is performing well or not. Thus we define a new function:

Here |T| refers to the number of terminal nodes in the tree T. Thus, it can be seen that the greater the number of leaves, the higher is the penalty. This function overcomes the shortcoming of the SSR by penalizing the trees with a greater number of terminal nodes.

To ascertain the value of α, first, we create a Decision Tree taking into consideration the entire data (training and testing), then we select different values of α which will correspond to different subtrees of the newly formed Decision Tree. For example, if the alpha value is zero we’ll select the complete tree since R_α=R for this case. Similarly, several different subtrees can be obtained by pruning the leaves of this tree which will correspond to the least R_α value for a given α.

Now that we have a list of possible values of α, we’ll select the best-suited value by fitting the model to our original training and testing dataset for all values of α and selecting the one with the maximum accuracy score.

For the Iris dataset, let’s work out pre-pruning and post-pruning with a code.

Pre-pruning:

Post-pruning:

‘α’ is selected in the range [0.01, 0.02] since a maximum for test accuracy is obtained in this range

From the above execution, it can be seen that the model performs much better after pruning the original decision tree.

Advantages

1. The merit of the Decision Tree over other algorithms lies in its simplicity to visually represent it and the ease of being able to comprehend the binary choices at each node which are made without any assumptions about the distribution of the data. This makes it a very intuitive algorithm for even a muggle (we’ll call them that for now).

2. What makes it incredibly powerful is its ability to work with both boolean and numerical as well as discrete and continuous data in form of classification trees and regression trees.

3. Decision Trees do not demand a lot of effort in data preparation during pre-processing, normalization, or feature scaling like linear regression, logistic regression, and Neural Networks.

Disadvantages

1. Decision Trees are at constant risk of overfitting on the sample data (which we shall explore later).

2. They are extremely sensitive to the slightest change in data.

3. They are comparatively much slower when it comes to regression and are practically applied mostly to classifications.

4. Decision Tree being a greedy algorithm always selects the current best split that maximizes the information gain and as with most greedy algorithms, it does not backtrack to change a previous split. This makes it vulnerable to reaching a decision that may not be the overall optimal choice (analogous to a logistic regression reaching a local minimum but not a global minimum).

CONCLUSION

Now give yourself a pat on the back for completing the article 😃.

Through some real-life examples, we introduced you to the basic idea of Decision Trees. We talked about the basic structural components of a decision tree, its advantages, and its disadvantages. We learned about the application of Decision Trees in regression and classification by grasping the underlying concepts of entropy, Gini, and variance through mathematical expressions and the intuition behind them. Finally, we dealt with the case of overfitting in decision trees and understood the methods of pre-pruning and post-pruning to deal with it.

So, revise, code, and watch out for our next article!

Follow us on:

  1. Facebook
  2. Instagram
  3. Linkedin
  4. YouTube

--

--

KDAG IIT KGP

We aim to provide ample opportunity & resources to all the AI/ML enthusiasts out there that are required to build a successful career in this emerging domain.