Machine Learning

Machine Learning: A Practical Guide for Aspiring Data Scientists

38 min read

Ensemble Learning in Python

Ensemble Learning is a technique that combines multiple machine learning models to improve overall performance and accuracy. It can be used for both classification and regression tasks. Ensemble methods create a strong model by combining the predictions of several weak learners. Some popular ensemble methods include Bagging, Boosting, and Stacking.

Bagging

Bagging, short for Bootstrap Aggregating, is an ensemble method that involves training multiple instances of the same model on different subsets of the training data. The final prediction is then obtained by aggregating the predictions of each model, typically by voting (for classification) or averaging (for regression).

The Random Forest algorithm is an example of a bagging technique that uses a collection of decision trees. Random Forest builds each tree using a random subset of the features and combines their predictions to make a final decision.

Random Forest

Let’s see how to implement Random Forest using Scikit-learn in Python:

# Importing the required libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

In the code above, we load the Iris dataset from Scikit-learn and split it into training and testing sets. We then create a Random Forest Classifier with 100 decision trees and fit it to the training data. Finally, we make predictions on the test set and calculate the accuracy of the model.

Boosting

Boosting is another ensemble method that aims to improve the performance of weak learners by sequentially training them. Unlike bagging, where models are trained independently, boosting trains models in a sequential manner, with each model trying to correct the errors of its predecessor.

One of the most popular boosting algorithms is AdaBoost (Adaptive Boosting). AdaBoost assigns higher weights to misclassified data points and trains subsequent models on these points to give more attention to the misclassified instances.

AdaBoost

Let’s see how to implement AdaBoost using Scikit-learn in Python:

# Importing the required libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Base estimator (Decision Tree)
base_estimator = DecisionTreeClassifier(max_depth=1)
# AdaBoost Classifier
adaboost_classifier = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50, random_state=42)
adaboost_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

In the code above, we load the Iris dataset from Scikit-learn and split it into training and testing sets. We define a weak learner as a Decision Tree with a maximum depth of 1. We then create an AdaBoost Classifier with 50 weak learners and fit it to the training data. Finally, we make predictions on the test set and calculate the accuracy of the model.

Stacking

Stacking, also known as Stacked Generalization, is a more advanced ensemble method that involves combining multiple base models and using a meta-model to make the final prediction. The base models are trained on the original features, and their predictions are then used as input features for the meta-model. It leverages the strengths of different base models, which can lead to improved predictive performance compared to individual models.

Stacking is not directly available in Scikit-learn, but it can be implemented using third-party libraries like mlxtend. Let’s see how to implement Stacking using mlxtend in Python:

# Importing the required libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.metrics import accuracy
_score
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Base models
base_models = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    LogisticRegression(random_state=42)
]
# Meta-model
meta_model = LogisticRegression()
# Stacking Classifier
stacking_classifier = StackingClassifier(classifiers=base_models, meta_classifier=meta_model)
stacking_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = stacking_classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

In the code above, we load the Iris dataset from Scikit-learn and split it into training and testing sets. We define two base models (Random Forest and Logistic Regression) and a meta-model (Logistic Regression). We then create a Stacking Classifier with the base models and the meta-model and fit it to the training data. Finally, we make predictions on the test set and calculate the accuracy of the model.


Table of Contents

Content List