Decision Trees
The concept of Decision Trees appeared in the 1960s in the field of psychology for modeling the concept of human learning, visually presenting possible outcomes to potential situations that stem from one prompt. It was around that time when people discovered the usefulness of Decision Trees in programming and mathematics. The first paper that was able to develop a concept of “Decision Tree” mathematically was published by William Belson in 1959. In 1977, various professors from Berkley and Stanford developed an algorithm known as Classification and Regression Trees (CART), which, true to its name, consisted of Classification and Regression Trees. To this day, CART still stands as an important algorithm for data analysis. In the field of ML, Decision Tree serves as one of the most popular algorithms for real-world data science problems.
Decision Trees serve as a tool that models the outcomes of observations in a visually interpretable format. It consists of nodes and branches, which stem from a root. Consider the following example of a series of choices. The first question asked “What color is the coin?” is called the root of the tree. Each possible outcome to the question is represented under nodes that branch out from the root. Imagine flipping a coin. There are two possible outcomes, heads or tail. Following the outcome of “Silver” to the question asked, we can continue the splitting process by asking another question: “What size is the coin?” For questions that are not the root, they’re considered parent nodes. Once we’ve answered all possible questions and created every node to their possible outcomes, the final nodes are called leaf nodes:
In a sense, Decision Trees can store and organize data based on their attributes. The ML concept of Decision Trees utilizes this intuition to categorize or classify data based on the patterns of their features. By splitting the samples according to their associated features, Decision Trees can be trained to learn different patterns in the data. Using this method, Decision Trees can handle high-dimensional data with ease as their approach differs greatly from KNN and regression methods. One of the major advantages of Decision Trees over the regression method is their interoperability as seen previously. It would be hard to visualize the gradient descent process on high-dimensional data but not for Decision Trees.
With features and targets, Decision Trees seek the best “questions” to ask in the best order to split the data that resembles those assigned by the targets. Take the Titanic dataset for example. It contains features about travelers on the Titanic such as their age, sex, cabin, fare, and so on. The aim is to predict whether someone survived based on their attributes given. A Decision Tree would split the data as follows; the process is demonstrated in the following figure:
- Choose a feature that best splits the data.
- For that feature, choose a threshold that would best split the data by its labels. A threshold for a categorical feature could be true/false, or if the feature contains more than one category, it would be converted to numbers representing each class and treated as a continuous feature.
- For each node that the split produces, repeat the preceding steps until the data is perfectly split, meaning that each node contains exactly one class or the max split, depth, defined by the user is reached.
imply stating “choosing the best feature to split the data” isn’t enough. Decision Trees utilize metrics that measure the “impurity” of data, which determines how well the data is split. If data samples only contain one class, it’s pure; on the other hand, if data samples contain half of one class and half of the other class, it’s considered impure and anything in between. The two common metrics used to determine impurity in Decision Trees are Gini Impurity and Entropy.
However, computationally Entropy is more complex as it requires the use of logarithms, so Gini Impurity is usually the metric used in the Decision Tree to determine splits. During the split of each node, we treat the unique values of the feature as one threshold, loop through all the thresholds, and calculate its respective Gini Impurity. We perform the preceding process for every feature and then determine the best Gini Impurity, which we will split the data according to its threshold. As mentioned before, the splitting continues until each node is pure or a user-specified hyperparameter “depth” is reached.
The depth hyperparameter decides how many “levels” of nodes should there be in a Decision Tree. Technically, the deeper a Decision Tree is, the higher the performance would be as the splits get more specific. Although increasing the depth may improve performance, most of the time it only applies to training data as the tree overfits easily with a high amount of depth, leading to worse performance on unseen testing data.