While bagging mitigates the variance problem, boosting can reduce bias through combination of weak learners.
Boosting can be viewed as gradient descent in the functional space. That as the current classifier, by Taylor expansion, we have
.
So out of an ensemble of weak classifiers, we want to find the best as follows
Denote , so we are looking for most aligned with
Gradient Boosted Regression Tree (GBRT)
Fixed depth (say = 4) regression trees will be used as weak learners. Step size is fixed as a hyper-parameter. And we can always normalize to a constant. Then
So we can train just a regular regression tree (with input and label ). And if overall square loss is used, .
AdaBoost (Adaptive Boosting)
While it is the first boosting algorithm, it has some very nice properties (training loss will decrease exponentially). And people then didn’t realize that this is the same as gradient descent in the functional space when it was introduced. is adapted to the data and hence the name. It can only handle binary labels and so cannot be used for regression.
Exponential loss function is used. Then, , where and is just the partition function. Then,
since and .
Optimal stepsize
Denote as . One can optimize the step size as .
Renormalization
We need to renormalize and after update. Denote the unnormalized as . Then, as . Note that this step essentially boost all incorrect samples () by and scale down all correct samples () by . And
.
Reference
Boosting lecture note from Cornell