Ensemble models in machine learning combine the output of multiple models to build a better performing model as compared to the individual models. The individual models whose output are used for the process of ensemble modeling are called baseline models or base estimators.
The objective behind this approach of ensemble learning is to alleviate the various type errors associated with individual models and thereby use the collective intelligence of multiple models to build a model with better performance metric.
Ensemble algorithms try to maintain the model’s generalization without compromise on minimizing the errors.
Ensemble modeling can be considered as bringing together the expertise of different people to solve a particular problem.
This method builds strong models from many weak base estimators.
Ensemble methods
The four types of methods usually follow to build ensemble models are as follows,
- Bagging (Bootstrapped Aggregation)
- Boosting
- Stacking
- Cascading
Bagging (Bootstrapped Aggregation)
The method of bagging is based on the idea of bootstrapping followed by aggregation. Bootstrapping involves the process of creating multiple subsets of data from the original dataset and replacement. Each subset of training data will be of equal size and training is carried out in a parallel way.
A query point will be passed to all the samples during run time and the output will be monitored.
Aggregation combines the outputs of each baseline model. If it is a classification task, the aggregation process will be based on majority vote rule and for regression problems, mean or median will be used.
In the bagging method, the aggregation step enables reduction in variance without any impact on bias. Hence bagging chooses the output base learners with low bias high variance models and performs aggregation to obtain low bias reduced variance models.
Random forest is an example of bagging where multiple decision trees are built from a subset of training data and combined using aggregation. Decision trees with reasonable depth are identified as low bias high variance models and aggregation gives a final output with low bias and reduced variance.
Random Forest: Decision trees (Base learner) + sampling with replacement of training data + Aggregation
Boosting
Boosting works in a sequential manner. The output from a model is given as input to the next model in a sequential way. Each model tries to correct the errors in the previous model. On the contrary to bagging, the boosting method is used to handle low variance high bias situations. Boosting method employs additive combining to reduce the bias.
Adaptive boosting (ADABOOST) and Gradient boosting methods (GBM) are two widely used ensemble models based on boosting methods.
ADABOOST algorithms adapt to the errors made in the previous stage and rectify it by assigning more weight to the misclassified points. Adaptive boosting is used in computer vision for face recognition applications.
Gradient boosting methods are popular with its proven record in predictive capabilities and so often used in Kaggle competitions. GBM can be used to perform both classification and regression tasks.
Stacking
In this method different individual models using support vector machines, K nearest neighbors etc. are built using the entire training data set. The features extracted from the output from these individual base learners are called meta features.
In the stacking method, the training data is split into many equal parts (say 12 parts). Out of these 12 parts, a base learner is built on 11 parts using any classifier or regression algorithm, for instance we use support vector machines preserving one part for testing.
In the next iteration another part will be preserved to make predictions/test and another classifier or regression model like k nearest neighbor or decision trees is used to fit the train data. This strategy repeats to cover the entire data set. The features extracted from predictions on each iteration are the features used to build a new model.
Stacking is less preferred for real world applications due its low latency in performance.
Cascading
Cascading is an ensemble method usually preferred when the cost of making a mistake is high. The method concatenates information collected from each classifier and uses it as additional information for the next classifier. In order to ensure the accurate prediction probability, in most cases a human being will be at the end of the chain of models. Cascading is used for fraud detection applications.
Methods to aggregate predictions
We use aggregation in ensemble learning to combine the predictions of different models. The following are the most common aggregation methods used in ensemble methods.
- Averaging: In this method we take the average of predictions from each model to make the final predictions. Averaging is usually employed in regression tasks. In some cases, we give weightage to predictions from some models based on the performance metric and this is called weighted averaging.
- Mean/Median: This is another method to perform aggregation for regression problems. Instead of taking the average of predictions from individual models, mean or median will be preferred in this method.
- Majority rule: This is usually followed in classification problems where the final prediction will be based on the majority voting.
Ensemble models have excellent potential to ensure good performance metric as they open up scope to combine different classification and regression techniques in an effective way. Though ensemble models are good for generalization, careful training is required to reduce training errors.
Despite the advantages, the time required to deploy ensemble models into production is high. Ensemble models can be considered as a good option where we want to produce a model with less bias, low variance and reduced noise and there is no consideration or less botheration about speed of throughput of particular prediction task.
Also, the interpretability of the final model using ensemble method is low as there are multiple algorithms involved in the process.