Decision Tree

 

Decision tree is a simple algorithm. Simply spliting data out of all possible features. A deeper tree will have high variance and shallower tree will have high bias. Naturally, we want to find a smallest tree to result to fit all data (we called a zero training error tree consistent). However, in general this problem is NP-hard. Decision tree by itself is very poor algorithm. But incorporated with bagging and boosting make them very powerful algorithm.

One great advantage of decision tree is that no pre-processing is generally needed for decision tree (and hence also random forest) algorithm. Each dimension can have very different scales and variances.

Impurity

To decide how to make a split, we want to minimize the “impurity” after each split. One very common measure of impurity is the Gini impurity, where the Gini impurity for a set S with K class is given by

G(S)=\sum_{k=1}^K \frac{|S_k|}{|S|} \left(1-\frac{|S_k|}{|S|} \right),

where S_k is the set of samples belong class k

Another measure of impurity is simply try to maximize the KL-divergence from the uniform distribution. Denote p_k = \frac{S_k}{S}, we have

\max KL(p\|\frac{1}{K}) = \sum_{k=1}^K p_k \log \frac{p_k}{1/K}

 

\max \sum_{k=1}^K p_k \log p_k,

which is the same as minimizing the entropy of S. Note that it is easy to verify that minimizing the weighted sum of the splitted set entropies is the same as minimizing the conditional entropies of splitted sets given the split.

Complexity

In practice, we can simply try out every split for each dimension during training. Therefore, the complexity will be O(N D K), where N is number of data point, D is the number of dimensions, and K is the number of classes (we need minimum O(K) for each entropy computation). Note that if we compute entropy from scratch each time, the complexity will be O(K N) instead. Because we need to go through all data to compute p_k. However, if we compute every split sequentially, we just need to update p_k for one data point. So the operation of updating p_k will only be O(1) instead of O(N).

Decision tree regression

One can readily apply decision tree algorithm for regression problem. The only difference is that rather than using Gini or entropy impurity. We can approximate impurity as the variance of the splitted set. Unlike the discrete case though, we can never get a consistent tree (that overfits the training data). So we may terminate further splitting either with pre-set depth is reached.

 

Update:

For the discussion above, the cut-point (the feature selected to split the tree) is computed through optimizing Gini or information gain. However, one may select that randomly. The resulting algorithm is known to be an Extra Tree rather than a decision forest. One may want to check this out also for their difference. 

Leave a Reply

Your email address will not be published. Required fields are marked *