Ensemble Learning in Python
Ensemble Learning is a technique that combines multiple machine learning models to improve overall performance and accuracy. It can be used for both classification and regression tasks. Ensemble methods create a strong model by combining the predictions of several weak learners. Some popular ensemble methods include Bagging, Boosting, and Stacking.
Bagging
Bagging, short for Bootstrap Aggregating, is an ensemble method that involves training multiple instances of the same model on different subsets of the training data. The final prediction is then obtained by aggregating the predictions of each model, typically by voting (for classification) or averaging (for regression).
The Random Forest algorithm is an example of a bagging technique that uses a collection of decision trees. Random Forest builds each tree using a random subset of the features and combines their predictions to make a final decision.
Random Forest
Let’s see how to implement Random Forest using Scikit-learn in Python:
# Importing the required libraries import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Load Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Random Forest Classifier rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42) rf_classifier.fit(X_train, y_train) # Make predictions on the test set y_pred = rf_classifier.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')
In the code above, we load the Iris dataset from Scikit-learn and split it into training and testing sets. We then create a Random Forest Classifier with 100 decision trees and fit it to the training data. Finally, we make predictions on the test set and calculate the accuracy of the model.
Boosting
Boosting is another ensemble method that aims to improve the performance of weak learners by sequentially training them. Unlike bagging, where models are trained independently, boosting trains models in a sequential manner, with each model trying to correct the errors of its predecessor.
One of the most popular boosting algorithms is AdaBoost (Adaptive Boosting). AdaBoost assigns higher weights to misclassified data points and trains subsequent models on these points to give more attention to the misclassified instances.
AdaBoost
Let’s see how to implement AdaBoost using Scikit-learn in Python:
# Importing the required libraries import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score # Load Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Base estimator (Decision Tree) base_estimator = DecisionTreeClassifier(max_depth=1) # AdaBoost Classifier adaboost_classifier = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50, random_state=42) adaboost_classifier.fit(X_train, y_train) # Make predictions on the test set y_pred = adaboost_classifier.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')
In the code above, we load the Iris dataset from Scikit-learn and split it into training and testing sets. We define a weak learner as a Decision Tree with a maximum depth of 1. We then create an AdaBoost Classifier with 50 weak learners and fit it to the training data. Finally, we make predictions on the test set and calculate the accuracy of the model.
Stacking
Stacking, also known as Stacked Generalization, is a more advanced ensemble method that involves combining multiple base models and using a meta-model to make the final prediction. The base models are trained on the original features, and their predictions are then used as input features for the meta-model. It leverages the strengths of different base models, which can lead to improved predictive performance compared to individual models.
Stacking is not directly available in Scikit-learn, but it can be implemented using third-party libraries like mlxtend. Let’s see how to implement Stacking using mlxtend in Python:
# Importing the required libraries import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from mlxtend.classifier import StackingClassifier from sklearn.metrics import accuracy _score # Load Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Base models base_models = [ RandomForestClassifier(n_estimators=100, random_state=42), LogisticRegression(random_state=42) ] # Meta-model meta_model = LogisticRegression() # Stacking Classifier stacking_classifier = StackingClassifier(classifiers=base_models, meta_classifier=meta_model) stacking_classifier.fit(X_train, y_train) # Make predictions on the test set y_pred = stacking_classifier.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')
In the code above, we load the Iris dataset from Scikit-learn and split it into training and testing sets. We define two base models (Random Forest and Logistic Regression) and a meta-model (Logistic Regression). We then create a Stacking Classifier with the base models and the meta-model and fit it to the training data. Finally, we make predictions on the test set and calculate the accuracy of the model.