Understanding characteristics of ensemble methods

Note: This article is an answer written by image_doctor to the question Base classifiers for boosting on StackExchange on 23 Mar 2012 at 13:20.

There are some characteristics which may add insight to an understanding of ensemble methods.

Bagging

Probably the simplest ensemble method, bagging, which is nothing more than a collection of similar homogeneous classifiers built on resampled training data and held together by a combination method, ameliorates the variance caused by instability in the base classifiers by averaging their outputs. The ensemble leverages this instability to address the variance component of the error of the base classifier and to a lesser degree their bias.

You can think of bagging as providing a significant degree of smoothing to what would otherwise be a very unstable "weak" base classifier.One reason, apart from their tendency towards computational efficiency, why weak classifiers are chosen is that they exhibi higher diversity, which is a beneficial characteristic for ensembles.

If you visualise an bagged ensemble full of very strong stable classifiers, they will have a very high degree of agreement on their classifications of examples presented to the ensemble. In effect they all vote the same way. A committee in which all members vote similarly has little utility over any single member of the committee.

So to work effectively an ensemble must embrace a degree of diversity amongst it's members. Clearly a committee of members who spew forth almost random opinions is not of great utility either. So some intermediate position between these extremes is sought.

In practice, as no complete theory on the subject exists, this compromise is found using empirical methods such as cross-validation or hold out trials. These are used to gauge a suitable strength for the base classifier.

Because this search for an optimum ensemble will normally involve adjusting parameters of the base classifiers and the ensemble itself, it is desirable that the number of such parameters be kept as small as possible. If not, the dimensionality of the parameter search space quickly means that finding the global minimum is computationally intractable. Decision trees are a popular choice because, as has been mentioned, they can be used effectively without necessarily tuning any of their parameters.

Random Forests

Random forests, which are primarily bagged decision trees, leverage the significant instability of trees by the injecting of a strong stochastic component [ the permutations of a small number of features/factors at each decision node within a tree ] to create diversity within the ensemble. Because each node of a tree is presented with a new random selection of features the trees are highly diverse. The ensemble then has the effect of averaging out the variance and bias of the diverse collection of trees.

To be effective a "random forest" of naive Bayes classifiers, or any other stable base classifier such as SVMs, needs the addition of stochastic element. For stable classifiers relatively small variations in training data, such as arise from bagging, lead to very similar classifiers.

To increase diversity other approaches could be applied. For instance permuting the features shown to each base classifier. This has a restriction that the significant available diversity is held to the number of combinations of the feature set. Once the combinations have been exhausted there are no new classifiers available to the ensemble which would vote differently to existing members.

For problems with relatively few features this severely limits the available pool of classifiers. It would be possible to inject further sources of randomness, say by aggressively sub-sampling the training data. The evidence would seem to be, that in the general case, such an approach is inferior to the particular blend of bias and diversity that a random forest offers.

It is possible to successfully utilise other unstable base classifiers, such as multi-layer perceptrons (neural networks) which have few nodes and restricted amounts of training or point based space filling approaches for instance stochastic discrimination, to inject diversity in ensembles methods. Certainly in the case of MLPs a degree of parameter tuning is essential.

Boosting

Boosting takes a different approach to building the ensemble than the simple agglomerative model adopted by Bagging. I suppose conceptually if you think of bagging as being a flat ensemble model, boosting constructs a layered classifier.

Each round of boosting chooses a new classifier from a set of potential classifiers constructed from training data weighted, or resampled, according to the mis-classifications of the previous round. The new classifier is selected so as to minimise the total ensemble error.

This is in sharp contrast to the lack of selection criteria resent in random forest ensemble construction. Each new base classifier is specifically required to focus on the weak points of the existing ensemble, with the result that boosting aggressively drives down training error.

In the early stages of ensemble construction boosting has few weak classifiers and each is focused on different areas of the training space, the effect of this is to primarily reduce bias. As the ensemble size grows the scope for bias reduction diminishes and error from variance is improved.

The benefit from instability in the base classifier for boosting is that as the ensemble grows, the number of remaining mis-classified examples falls. A higher degree of diversity is needed to generate a classifier which adopts a usefully different view of the remaining samples than its predecessors.

The power of this approach can be seen by the fact that acceptable results can be achieved with only decision stumps, though MLPs have proved very effective in general.

Because of this constant focus on the misclassified examples, boosting's weakness is that it can be susceptible to noise, to some extent logitboost attempts to address this failing.

No free lunch

It is worth remembering that no grand unified theory of machine learning exists and that the results of any particular classifier are highly dependent on the type of data it is used with. So, a priori, there isn't any hard and fast reason to assert one classifier type being superior over another, other than the consensus derived from previous experimentation with similar data and the general utility shown by an algorithm across a variety of data sets. To get a good solution, you may want to experiment with a handful of popular approaches.

Contents

Bagging

Random Forests

Boosting

No free lunch