The goal of bagging is to avoid overfitting (high variance). Instead of training one model, we can bootstrap the dataset for the training of multiple models. Then, the output will simply be the average (for regression) or the majority vote (for classification) of the outputs evaluated by the trained models . There are at least two advantages for bagging:
- Reduce the variance of the output
- Provide a mean to quantify the quality of the current estimate (based on the variance across multiple outputs)
Another advantage of bagging is out-of-bag error. It is almost free (do not need to reserve a validation set) to estimate testing error. Because when we bootstrap for a sub-model, there should be unsampled data from the dataset. The unsample data is considered as “out-of-bag” and for that submodel, it can be considered as the validation data. So more precisely
- Iterate each data sample over the entire dataset
- Compute the prediction error of the current sample
- Weighted average the prediction error with weight equal to the fraction of learners where the sample is “out-of-bag”
One example of bagging is the random forest.