# Boosting

## Summary

The underlying idea of this method is to combine simple "rules" or weak learners, each performing only slightly better than random, to form an ensemble such that the performance of the single ensemble member is improved, i.e "boosted". Let ${\displaystyle h_{1},h_{2},\ldots ,h_{T}}$ be a set of hypotheses, and consider the composite ensemble hypothesis

${\displaystyle f(x)=\sum _{t=1}^{T}\alpha _{t}h_{t}(x)}$

Here ${\displaystyle \alpha _{t}}$ denotes the coefficient with which the ensemble member ${\displaystyle h_{t}}$ is combined; both ${\displaystyle \alpha _{t}}$ and the learner or the hypotheses ${\displaystyle h_{t}}$ are to be learned within the Boosting procedure. The coefficients ${\displaystyle \alpha _{t}'s}$ (or weights) are set in iterations. The intuitive idea is that examples that are misclassified get higher weights in the next iteration, for instance the examples near the decision boundary are usually harder to classify and therefor get high weights after a few iterations.

In general boosting can be seen as a method for improving the accuracy of any given learning algorithm. It is used when building a highly accurate prediction rule is a difficult task but it is not hard to come up with very rough rules of thumbs that are only moderately accurate.

Friedman et al. 2000 in his paper shows that AdaBoost can be interpreted as stage wise estimation procedures for fitting an additive logistic regression model.

In its standard form, boosting does not allow for the direct incorporation of prior human knowledge. Rochery et al.2002 describe a modiﬁcation of boosting that combines and balances human expertise with available training data. The aim of the approach is to allow the human’s rough judgments to be reﬁned, reinforced and adjusted by the statistics of the training data, but in a manner that does not permit the data to entirely overwhelm human judgments.

Tha AdaBoost algorithm by Freund and Schapire is one of the most successful Boosting algorithm. We focus on the problem of binary classfication and present the AdaBoost algorithm for it.

A non negative weighting ${\displaystyle \mathbf {d} ^{(t)}=(d_{1}^{(t)},\ldots ,d_{N}^{(t)})}$ is assigned to the data at step ${\displaystyle t}$, and a weak learner ${\displaystyle h_{t}}$ is constructed based on ${\displaystyle \mathbf {d} ^{(t)}}$. This weighting is updated at each iteration ${\displaystyle t}$ according to the weighted error incurred by the weak learner in the last iteration. At each step ${\displaystyle t}$, the weak learner is required to produce a small weighted empirical error defined by

${\displaystyle \varepsilon _{t}(h_{t},\mathbf {d} ^{(t)})=\sum _{n=1}^{N}d_{n}^{(t)}\mathbf {I} (y_{n}\neq h_{t}({\boldsymbol {x}}_{n}))}$

After selecting the hypothesis ${\displaystyle h_{t}}$, its weight ${\displaystyle \alpha _{t}}$ is computed such that it minimizes a certain loss function. In AdaBoost one minimizes

${\displaystyle \mathrm {G} ^{AB}(\alpha )=\sum _{n=1}^{N}\mathrm {exp} \{-y_{n}(\alpha h_{t}({\boldsymbol {x}}_{n})+f_{t-1}({\boldsymbol {x}}_{n}))\}}$,

where ${\displaystyle f_{t-1}}$ is the combined hypotheses of the previous iteration given by

${\displaystyle f_{t-1}({\boldsymbol {x}}_{n})=\sum _{r=1}^{t-1}\alpha _{r}h_{r}({\boldsymbol {x}}_{n})}$

For AdaBoost it has been shown that ${\displaystyle \alpha _{t}}$ can be computed analytically leading to the expression in step (3c) of the algorithm. Based on the new combined hypotheses, the weighting ${\displaystyle {\boldsymbol {d}}}$ of the sample is updated as in step (3d) of algorithm. The initial weighting ${\displaystyle {\boldsymbol {d}}^{(1)}}$ is chosen uniformly: ${\displaystyle d_{n}^{(n)}=1/N}$

## Comment

An interesting take on boosting is contained in Friedman et al. 2000, "Additive Logistic Regression": http://www.stanford.edu/~hastie/Papers/AdditiveLogisticRegression/alr.pdf They also talk about it in the Hastie et al. "ESL" textbook.

--Brendan 22:35, 13 October 2011 (UTC)