While bagging mitigates the variance problem, boosting can reduce bias through combination of weak learners.
Boosting can be viewed as gradient descent in the functional space. That as the current classifier, by Taylor expansion, we have
.
So out of an ensemble of weak classifiers, we want to find the best
as follows
Denote , so we are looking for
most aligned with
Gradient Boosted Regression Tree (GBRT)
Fixed depth (say = 4) regression trees will be used as weak learners. Step size is fixed as a hyper-parameter. And we can always normalize
to a constant. Then
So we can train just a regular regression tree (with input
and label
). And if overall square loss is used,
.
AdaBoost (Adaptive Boosting)
While it is the first boosting algorithm, it has some very nice properties (training loss will decrease exponentially). And people then didn’t realize that this is the same as gradient descent in the functional space when it was introduced. is adapted to the data and hence the name. It can only handle binary labels and so cannot be used for regression.
Exponential loss function is used. Then,
, where
and
is just the partition function. Then,
since and
.
Optimal stepsize
Denote as
. One can optimize the step size
as
.
Renormalization
We need to renormalize and
after update. Denote the unnormalized
as
. Then,
as
. Note that this step essentially boost all incorrect samples (
) by
and scale down all correct samples (
) by
. And
.
Reference
Boosting lecture note from Cornell