Boosting – Random Thought

While bagging mitigates the variance problem, boosting can reduce bias through combination of weak learners.

Boosting can be viewed as gradient descent in the functional space. That $H$ as the current classifier, by Taylor expansion, we have

$l(H+\alpha h) = l(H) + \alpha \langle \nabla l (H) , h \rangle$ .

So out of an ensemble $\mathcal H$ of weak classifiers, we want to find the best $h \in \mathcal H$ as follows

$\arg\min_{h \in {\mathcal H}} \langle \nabla l (H) , h \rangle$

$= \arg\min_{h \in {\mathcal H}} \sum_{i=1}^n \frac{\partial l}{\partial H({\bf x}_i)} h({\bf x}_i)$

Denote $t_i = - \frac{\partial l}{\partial H({\bf x}_i)}$ , so we are looking for $h$ most aligned with ${\bf t} = [t_1,\cdots,t_n]$

Gradient Boosted Regression Tree (GBRT)

Fixed depth (say = 4) regression trees will be used as weak learners. Step size $\alpha$ is fixed as a hyper-parameter. And we can always normalize $\sum_{i=1}^n h({\bf x}_i)$ to a constant. Then

$\arg\min_{h \in {\mathcal H}} \sum_{i=1}^n -t_i h({\bf x}_i) =\arg\min_{h \in {\mathcal H}} \sum_{i=1}^n t_i^2 -2 t_i h({\bf x}_i)+h({\bf x}_i)^2 = \arg\min_{h \in {\mathcal H}} \sum_{i=1}^n ( h({\bf x}_i)- t_i)^2$

So we can train $h(\cdot)$ just a regular regression tree (with input ${\bf x}_i$ and label $t_i$ ). And if overall square loss is used, $t_i = -\frac{\partial l}{\partial H({\bf x}_i)}=-\frac{1}{2} \frac{\partial l}{\partial H({\bf x}_i)} \sum_{i=1}^n (H({\bf x}_i)-y_i)^2= y_i - H({\bf x}_i)$ .

AdaBoost (Adaptive Boosting)

While it is the first boosting algorithm, it has some very nice properties (training loss will decrease exponentially). And people then didn’t realize that this is the same as gradient descent in the functional space when it was introduced. $\alpha$ is adapted to the data and hence the name. It can only handle binary labels and so cannot be used for regression.

Exponential loss function $l(H) = \sum_{i=1}^n e^{-y_i H({\bf x}_i)}$ is used. Then, $\frac{\partial l}{\partial H({\bf x}_i)} = -y_i e^{-y_i H({\bf x}_i)}=-y_i w_i Z$ , where $w_i = \frac{1}{Z} e^{-y_i H({\bf x}_i)}$ and $Z= \sum_{i=1}^n e^{-y_i H({\bf x}_i)}=l(H)$ is just the partition function. Then,

$h^* =\arg \min_{h \in \mathcal H} - \sum_{i=1}^n w_i y_i Z h({\bf x}_i) = \arg \min_{h \in \mathcal H}- \sum_{i=1}^n w_i y_i h({\bf x}_i)$

$= \arg \min \sum_{y_i \neq h({\bf x}_i)} w_i - \sum_{y_i = h({\bf x}_i)} w_i$

$= \arg \min \sum_{y_i \neq h({\bf x}_i)} w_i$

since $\sum_i w_i = 1$ and $y_i, h({\bf x}_i) \in \{-1,1\}$ .

Optimal stepsize

Denote $\sum_{i:h^*({\bf x}_i) \neq y_i} w_i$ as $\epsilon$ . One can optimize the step size $\alpha$ as $\alpha=\frac{1}{2} \ln \frac{1-\epsilon}{\epsilon}$ .

Renormalization

We need to renormalize $w_i$ and $Z$ after update. Denote the unnormalized $w_i, e^{-y_i H({\bf x}_i)}$ as $\hat w_i$ . Then, $\hat w_i \leftarrow \hat w_i e^{-\alpha h^*({\bf x}_i)y_i}$ as $H({\bf x}_i) \leftarrow H({\bf x}_i) + \alpha h^*({\bf x}_i)$ . Note that this step essentially boost all incorrect samples ( $h^*({\bf x}_i) \neq y_i$ ) by $e^{\alpha}$ and scale down all correct samples ( $h^*({\bf x}_i) =y_i$ ) by $e^{-\alpha}$ . And

$Z \leftarrow \sum_{i=1}^n e^{-y_i (H({\bf x}_i)+\alpha h^*({\bf x}_i))}= Z \sum_{i=1}^n w_i e^{- \alpha y_i h^*({\bf x}_i))}=Z\left(\sum_{i:h^*({\bf x}_i)\neq y_i} w_i e^\alpha + \sum_{i:h^*({\bf x}_i)= y_i} w_i e^{-\alpha} \right) = Z (e^\alpha \epsilon + e^{-\alpha} (1-\epsilon))=Z\left(\epsilon \sqrt{\frac{1-\epsilon}{\epsilon}}+(1-\epsilon)\sqrt{\frac{\epsilon}{1-\epsilon}}\right)=Z*2\sqrt{\epsilon(1-\epsilon)}$ .

Reference

Boosting lecture note from Cornell

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Gradient Boosted Regression Tree (GBRT)

AdaBoost (Adaptive Boosting)

Optimal stepsize

Renormalization

Reference

Leave a Reply Cancel reply