Ensemble models in machine learning combine the output of multiple models to build a better performing model as compared to the individual models. The individual models whose output are used for the process of ensemble modeling are called baseline models or base estimators.
The objective behind this approach of ensemble learning is to alleviate the various type errors associated with individual models and thereby use the collective intelligence of multiple models to build a model with better performance metric.
Ensemble algorithms try to maintain the model’s generalization without compromise on minimizing the errors.
Ensemble modeling can be considered as bringing together the expertise of different people to solve a particular problem.
This method builds strong models from many weak base estimators.
Ensemble methods
The four types of methods usually follow to build ensemble models are as follows,
- Bagging (Bootstrapped Aggregation)
- Boosting
- Stacking
- Cascading
Bagging (Bootstrapped Aggregation)
The method of bagging is based on the idea of bootstrapping followed by aggregation. Bootstrapping involves the process of creating multiple subsets of data from the original dataset and replacement. Each subset of training data will be of equal size and training is carried out in a parallel way.
A query point will be passed to all the samples during run time and the output will be monitored.
Aggregation combines the outputs of each baseline model. If it is a classification task, the aggregation process will be based on majority vote rule and for regression problems, mean or median will be used.
In the bagging method, the aggregation step enables reduction in variance without any impact on bias. Hence bagging chooses the output base learners with low bias high variance models and performs aggregation to obtain low bias reduced variance models.
Random forest is an example of bagging where multiple decision trees are built from a subset of training data and combined using aggregation. Decision trees with reasonable depth are identified as low bias high variance models and aggregation gives a final output with low bias and reduced variance.
Random Forest: Decision trees (Base learner) + sampling with replacement of training data + Aggregation
Example Case: Predicting Housing Prices
Imagine you’re tasked with predicting housing prices based on various features such as location, size, number of bedrooms, and so on. You decide to use a bagging ensemble technique, specifically Random Forest, to tackle this regression problem.
Python Code Example:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load the Boston housing dataset
data = load_boston()
X, y = data.data, data.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
# Train the Random Forest model
rf_regressor.fit(X_train, y_train)
# Predict on the test set
y_pred = rf_regressor.predict(X_test)
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
Explanation:
In this example, we use the Boston housing dataset to predict housing prices. We split the data into training and testing sets. Then, we initialize a Random Forest Regressor with 100 decision trees. Each decision tree is trained on a randomly selected subset of the training data using bootstrapping.
During the prediction phase, each decision tree predicts the housing price for a given input. The final prediction is obtained by aggregating the predictions from all decision trees. For regression tasks like this one, aggregation typically involves taking the mean of all predictions.
By combining the predictions of multiple decision trees trained on different subsets of data, Random Forest reduces overfitting and provides more accurate and robust predictions compared to individual decision trees.
In summary, bagging techniques like Random Forest leverage the power of ensemble learning to enhance predictive performance by reducing variance without significantly increasing bias, making them valuable tools in machine learning.
Access the dataset
You can use the Boston housing dataset directly from scikit-learn, a popular machine learning library in Python. Here’s how you can access it:
from sklearn.datasets import load_boston
# Load the Boston housing dataset
data = load_boston()
X, y = data.data, data.target
This code will load the Boston housing dataset (X
contains the features, and y
contains the target variable, i.e., housing prices). You don’t need to download it from an external link as it’s included in the scikit-learn library.
If you prefer to access the dataset documentation, you can find it here: Boston Housing Dataset
Boosting
Boosting works in a sequential manner. The output from a model is given as input to the next model in a sequential way. Each model tries to correct the errors in the previous model. On the contrary to bagging, the boosting method is used to handle low variance high bias situations. Boosting method employs additive combining to reduce the bias.
Adaptive boosting (ADABOOST) and Gradient boosting methods (GBM) are two widely used ensemble models based on boosting methods.
ADABOOST algorithms adapt to the errors made in the previous stage and rectify it by assigning more weight to the misclassified points. Adaptive boosting is used in computer vision for face recognition applications.
Gradient boosting methods are popular with its proven record in predictive capabilities and so often used in Kaggle competitions. GBM can be used to perform both classification and regression tasks.
Example Case: Predicting Diabetes
Consider a scenario where you aim to predict the likelihood of diabetes in patients based on various health indicators. You opt to use AdaBoost and GBM to tackle this classification problem.
Python Code Example:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the diabetes dataset
data = load_diabetes()
X, y = data.data, data.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize AdaBoost classifier
adaboost_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
# Train the AdaBoost model
adaboost_clf.fit(X_train, y_train)
# Predict on the test set
y_pred_adaboost = adaboost_clf.predict(X_test)
# Calculate accuracy
accuracy_adaboost = accuracy_score(y_test, y_pred_adaboost)
print("AdaBoost Classifier Accuracy:", accuracy_adaboost)
# Initialize Gradient Boosting classifier
gbm_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
# Train the GBM model
gbm_clf.fit(X_train, y_train)
# Predict on the test set
y_pred_gbm = gbm_clf.predict(X_test)
# Calculate accuracy
accuracy_gbm = accuracy_score(y_test, y_pred_gbm)
print("Gradient Boosting Classifier Accuracy:", accuracy_gbm)
Explanation:
In this example, we aim to predict diabetes using the diabetes dataset. We split the data into training and testing sets and then initialize both AdaBoost and Gradient Boosting classifiers.
AdaBoost adapts to errors from the previous stage by assigning more weight to misclassified points. This iterative process focuses on difficult-to-classify instances, improving overall performance.
On the other hand, Gradient Boosting builds trees sequentially, with each tree fitting to the residuals of the previous tree, effectively reducing the error in subsequent iterations. This method excels in predictive capabilities and is commonly used in competitions like Kaggle.
By employing AdaBoost and Gradient Boosting, we can effectively address the challenges of bias in the dataset and achieve accurate predictions in our diabetes classification task
Access the dataset
You can use the diabetes dataset directly from scikit-learn, a popular machine learning library in Python. Here’s how you can access it:
from sklearn.datasets import load_diabetes
# Load the diabetes dataset
data = load_diabetes()
X, y = data.data, data.target
This code will load the diabetes dataset (X
contains the features, and y
contains the target variable, i.e., whether a patient has diabetes or not). You don’t need to download it from an external link as it’s included in the scikit-learn library.
If you prefer to access the dataset documentation, you can find it here: Diabetes Dataset
Stacking
In this method different individual models using support vector machines, K nearest neighbors etc. are built using the entire training data set. The features extracted from the output from these individual base learners are called meta features.
In the stacking method, the training data is split into many equal parts (say 12 parts). Out of these 12 parts, a base learner is built on 11 parts using any classifier or regression algorithm, for instance we use support vector machines preserving one part for testing.
In the next iteration another part will be preserved to make predictions/test and another classifier or regression model like k nearest neighbor or decision trees is used to fit the train data. This strategy repeats to cover the entire data set. The features extracted from predictions on each iteration are the features used to build a new model.
Stacking is less preferred for real world applications due its low latency in performance.
Example Case: Predicting Credit Risk
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
# Initialize base classifiers
rf_clf = RandomForestClassifier()
knn_clf = KNeighborsClassifier()
svm_clf = SVC()
lr_clf = LogisticRegression()
# Generate meta features using cross-validation
meta_features = []
for clf in [rf_clf, knn_clf, svm_clf, lr_clf]:
meta_features.append(cross_val_predict(clf, X_train, y_train, cv=5, method='predict_proba'))
# Stack the meta features
X_meta = np.concatenate(meta_features, axis=1)
# Fit the meta model
meta_model = LogisticRegression()
meta_model.fit(X_meta, y_train)
# Generate meta features for the test set
meta_features_test = []
for clf in [rf_clf, knn_clf, svm_clf, lr_clf]:
clf.fit(X_train, y_train)
meta_features_test.append(clf.predict_proba(X_test))
X_meta_test = np.concatenate(meta_features_test, axis=1)
# Make predictions using the meta model
stacking_predictions = meta_model.predict(X_meta_test)
# Calculate accuracy
accuracy_stacking = accuracy_score(y_test, stacking_predictions)
print("Stacking Classifier Accuracy:", accuracy_stacking)
Explanation:
In this example, we aim to predict credit risk using stacking, a method that combines the predictions of multiple base classifiers. We initialize base classifiers such as Random Forest, K-Nearest Neighbors, Support Vector Machines, and Logistic Regression.
Using cross-validation, we generate meta features by obtaining predictions from each base classifier on different subsets of the training data. These meta features are then concatenated to form the input for a meta model, which learns to combine the predictions of the base classifiers.
During testing, we generate meta features for the test set using the trained base classifiers, and then use the meta model to make final predictions.
Stacking allows us to leverage the strengths of different classifiers and potentially improve predictive performance by learning to effectively combine their outputs.
Access the dataset
You can use the Credit Risk dataset from the UCI Machine Learning Repository, which is a commonly used dataset for predicting credit risk. You can download the dataset from the following link:
This dataset contains various features such as credit amount, duration, employment status, and more, which can be used to predict whether a credit applicant is a good or bad credit risk.
Once you download the dataset, you can load it into your Python environment using libraries like pandas or scikit-learn for further processing and modeling.
Cascading
Cascading is an ensemble method usually preferred when the cost of making a mistake is high. The method concatenates information collected from each classifier and uses it as additional information for the next classifier. In order to ensure the accurate prediction probability, in most cases a human being will be at the end of the chain of models. Cascading is used for fraud detection applications.
Example Case: Fraud Detection in Banking
Consider a banking scenario where detecting fraudulent transactions is crucial. Here, we implement cascading to sequentially evaluate transactions and identify potential fraud.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
# Initialize base classifiers
rf_clf = RandomForestClassifier()
dt_clf = DecisionTreeClassifier()
lr_clf = LogisticRegression()
# Train the base classifiers
rf_clf.fit(X_train, y_train)
dt_clf.fit(X_train, y_train)
lr_clf.fit(X_train, y_train)
# Get predictions from base classifiers
rf_pred = rf_clf.predict(X_test)
dt_pred = dt_clf.predict(X_test)
lr_pred = lr_clf.predict(X_test)
# Concatenate predictions to form additional information
additional_info = np.column_stack((rf_pred, dt_pred, lr_pred))
# Human evaluation
human_evaluated_predictions = []
for info in additional_info:
# Perform additional checks or evaluation by a human
if info.sum() >= 2: # If at least 2 classifiers predict fraud
human_evaluated_predictions.append(1) # Predict as fraud
else:
human_evaluated_predictions.append(0) # Predict as non-fraud
# Calculate accuracy
accuracy_cascading = accuracy_score(y_test, human_evaluated_predictions)
print("Cascading Classifier Accuracy:", accuracy_cascading)
Explanation
In this example, we apply cascading to detect fraudulent transactions in banking. We start by training base classifiers including Random Forest, Decision Tree, and Logistic Regression models. Each classifier independently predicts whether a transaction is fraudulent.
Next, we concatenate the predictions from these base classifiers to form additional information. This information is then passed to a human evaluator who performs additional checks or evaluations.
Based on the combined predictions and the human evaluation, we make a final prediction on whether a transaction is fraudulent or not.
Cascading allows us to leverage multiple classifiers while incorporating human judgment to ensure accurate prediction probabilities, making it particularly suitable for applications like fraud detection where the cost of errors is high.
Access the dataset
For a fraud detection scenario in banking, you might not find a specific dataset that exactly fits the example described above. However, you can simulate a dataset or use a dataset with similar characteristics from various sources. Here are a few options:
- Kaggle: Kaggle hosts a variety of datasets related to finance and fraud detection. You can search for datasets related to credit card fraud or financial transactions.
- UCI Machine Learning Repository: This repository hosts several datasets related to fraud detection and financial transactions. Although there might not be an exact match for the described scenario, you can find datasets that are relevant for experimentation.
- Synthetic Data Generation: If you’re unable to find a suitable dataset, you can generate synthetic data that mimics the characteristics of fraudulent transactions. Libraries like
faker
in Python can help in generating synthetic data.
Methods to aggregate predictions
We use aggregation in ensemble learning to combine the predictions of different models. The following are the most common aggregation methods used in ensemble methods.
- Averaging: In this method we take the average of predictions from each model to make the final predictions. Averaging is usually employed in regression tasks. In some cases, we give weightage to predictions from some models based on the performance metric and this is called weighted averaging.
- Mean/Median: This is another method to perform aggregation for regression problems. Instead of taking the average of predictions from individual models, mean or median will be preferred in this method.
- Majority rule: This is usually followed in classification problems where the final prediction will be based on the majority voting.
Ensemble models have excellent potential to ensure good performance metric as they open up scope to combine different classification and regression techniques in an effective way. Though ensemble models are good for generalization, careful training is required to reduce training errors.
Despite the advantages, the time required to deploy ensemble models into production is high. Ensemble models can be considered as a good option where we want to produce a model with less bias, low variance and reduced noise and there is no consideration or less botheration about speed of throughput of particular prediction task.
Also, the interpretability of the final model using ensemble method is low as there are multiple algorithms involved in the process.
End Note:
Thank you for exploring the world of ensemble models in machine learning with us! We hope this article has provided valuable insights into the diverse techniques and applications of ensemble learning.
Your feedback is invaluable to us. If you found this article helpful or have any suggestions for improvement, we’d love to hear from you. Please feel free to leave your comments and thoughts below.
If you’re aspiring to pursue data science job roles, we’ve compiled a comprehensive folder of interview questions to support your preparation.
For more articles on machine learning techniques, applications, and insights, do subscribe to our blog
Keep learning, exploring, and innovating with machine learning!
Pingback: Large Language Models in Deep Learning - Intuitive Tutorials