Boosting

While bagging mitigates the variance problem, boosting can reduce bias through combination of weak learners.

Boosting can be viewed as gradient descent in the functional space. That H as the current classifier, by Taylor expansion, we have

l(H+\alpha h) = l(H) + \alpha \langle \nabla l (H) , h \rangle.

So out of an ensemble \mathcal H of weak classifiers, we want to find the best h \in \mathcal H as follows

\arg\min_{h \in {\mathcal H}} \langle \nabla l (H) , h \rangle

= \arg\min_{h \in {\mathcal H}} \sum_{i=1}^n \frac{\partial l}{\partial H({\bf x}_i)} h({\bf x}_i)

Denote t_i = - \frac{\partial l}{\partial H({\bf x}_i)}, so we are looking for h most aligned with {\bf t} = [t_1,\cdots,t_n]

Gradient Boosted Regression Tree (GBRT)

Fixed depth (say = 4) regression trees will be used as weak learners. Step size \alpha is fixed as a hyper-parameter. And we can always normalize \sum_{i=1}^n h({\bf x}_i) to a constant. Then

\arg\min_{h \in {\mathcal H}} \sum_{i=1}^n -t_i h({\bf x}_i) =\arg\min_{h \in {\mathcal H}} \sum_{i=1}^n t_i^2 -2 t_i h({\bf x}_i)+h({\bf x}_i)^2 = \arg\min_{h \in {\mathcal H}} \sum_{i=1}^n ( h({\bf x}_i)- t_i)^2

So we can train h(\cdot) just a regular regression tree (with input {\bf x}_i and label t_i). And if overall square loss is used, t_i = -\frac{\partial l}{\partial H({\bf x}_i)}=-\frac{1}{2} \frac{\partial l}{\partial H({\bf x}_i)} \sum_{i=1}^n (H({\bf x}_i)-y_i)^2= y_i - H({\bf x}_i).

AdaBoost (Adaptive Boosting)

While it is the first boosting algorithm, it has some very nice properties (training loss will decrease exponentially). And people then didn’t realize that this is the same as gradient descent in the functional space when it was introduced. \alpha is adapted to the data and hence the name. It can only handle binary labels and so cannot be used for regression. 

Exponential loss function l(H) = \sum_{i=1}^n e^{-y_i H({\bf x}_i)} is used. Then, \frac{\partial l}{\partial H({\bf x}_i)} = -y_i e^{-y_i H({\bf x}_i)}=-y_i w_i Z, where w_i = \frac{1}{Z} e^{-y_i H({\bf x}_i)} and Z= \sum_{i=1}^n e^{-y_i H({\bf x}_i)}=l(H) is just the partition function. Then, 

h^* =\arg \min_{h \in \mathcal H} - \sum_{i=1}^n w_i y_i Z h({\bf x}_i) = \arg \min_{h \in \mathcal H}- \sum_{i=1}^n w_i y_i h({\bf x}_i)

= \arg \min \sum_{y_i \neq h({\bf x}_i)} w_i - \sum_{y_i = h({\bf x}_i)} w_i

= \arg \min \sum_{y_i \neq h({\bf x}_i)} w_i

since  \sum_i w_i = 1 and y_i, h({\bf x}_i) \in \{-1,1\}.

Optimal stepsize

Denote \sum_{i:h^*({\bf x}_i) \neq y_i} w_i as \epsilon. One can optimize the step size \alpha as \alpha=\frac{1}{2} \ln \frac{1-\epsilon}{\epsilon}.

Renormalization 

We need to renormalize w_i and Z after update. Denote the unnormalized w_i, e^{-y_i H({\bf x}_i)} as \hat w_i. Then, \hat w_i \leftarrow \hat w_i e^{-\alpha h^*({\bf x}_i)y_i} as H({\bf x}_i) \leftarrow H({\bf x}_i) + \alpha h^*({\bf x}_i). Note that this step essentially boost all incorrect samples (h^*({\bf x}_i) \neq y_i) by e^{\alpha} and scale down all correct samples (h^*({\bf x}_i) =y_i) by e^{-\alpha}. And 

Z \leftarrow \sum_{i=1}^n e^{-y_i (H({\bf x}_i)+\alpha h^*({\bf x}_i))}= Z \sum_{i=1}^n w_i e^{- \alpha y_i h^*({\bf x}_i))}=Z\left(\sum_{i:h^*({\bf x}_i)\neq y_i} w_i e^\alpha + \sum_{i:h^*({\bf x}_i)= y_i} w_i e^{-\alpha} \right) = Z (e^\alpha \epsilon + e^{-\alpha} (1-\epsilon))=Z\left(\epsilon \sqrt{\frac{1-\epsilon}{\epsilon}}+(1-\epsilon)\sqrt{\frac{\epsilon}{1-\epsilon}}\right)=Z*2\sqrt{\epsilon(1-\epsilon)}.

 

Reference

Boosting lecture note from Cornell

Leave a Reply

Your email address will not be published. Required fields are marked *