Machine Learning

Machine Learning: A Practical Guide for Aspiring Data Scientists

38 min read

Semi-Supervised Learning

Semi-Supervised Learning is a hybrid approach that combines both supervised and unsupervised learning. It utilizes a small amount of labeled data and a large amount of unlabeled data to improve model performance. Semi-supervised learning is particularly useful when obtaining labeled data is expensive or time-consuming.

Label Propagation

Label Propagation is a semi-supervised learning algorithm that propagates labels from labeled data points to unlabeled data points based on their similarity. The underlying assumption is that data points in close proximity are likely to have the same label. Scikit-learn provides an implementation of the Label Propagation algorithm:

# Importing the required libraries
import numpy as np
from sklearn.datasets import make_circles
from sklearn.semi_supervised import LabelPropagation
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_circles(n_samples=100, noise=0.05, random_state=42)
# Adding some random labels
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(y)) < 0.3
y[random_unlabeled_points] = -1
# Applying Label Propagation
label_propagation = LabelPropagation(kernel='knn', n_neighbors=10)
label_propagation.fit(X, y)
# Plotting the results
plt.scatter(X[:, 0], X[:, 1], c=label_propagation.transduction_, cmap='viridis', edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Label Propagation in Python')
plt.show()

In the code above, we create a sample dataset using Scikit-learn's make_circles function. We randomly label some points and apply the Label Propagation algorithm to propagate labels to the unlabeled points. The scatter plot shows the results with different colors representing different clusters.

Self-Training

Self-Training is another approach in semi-supervised learning where the model starts with a small set of labeled data and iteratively labels the unlabeled data with the highest confidence predictions. The newly labeled data is then added to the training set for the next iteration. Let's see how to implement Self-Training in Python:

# Importing the required libraries
import numpy as np
from sklearn.datasets import make_moons
from sklearn.svm import SVC
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_moons(n_samples=100, noise=0.1, random_state=42)
# Adding some random labels
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(y)) < 0.5
y[random_unlabeled_points] = -1
# Implementing Self-Training
self_training_clf = SVC(kernel='linear', probability=True)
self_training_clf.fit(X, y)
# Unlabeled points with high confidence predictions
pseudo_labeled_indices = np.where(self_training_clf.predict_proba(X)[:, 1] > 0.9)
X_pseudo_labeled = X[pseudo_labeled_indices]
y_pseudo_labeled = self_training_clf.predict(X)[pseudo_labeled_indices]
# Combining labeled and pseudo-labeled data
X_combined = np.vstack((X, X_pseudo_labeled))
y_combined = np.hstack((y, y_pseudo_labeled))
# Plotting the results
plt.scatter(X_combined[y_combined==-1][:, 0], X_combined[y_combined==-1][:, 1], c='grey', marker='o', label='Unlabeled')
plt.scatter(X_combined[y_combined!=-1][:, 0], X_combined[y_combined!=-1][:, 1], c=y_combined[y_combined!=-1], cmap='viridis', edgecolors='k', label='Labeled')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Self-Training in Python')
plt.legend()
plt.show()

In the code above, we create a sample dataset using Scikit-learn's make_moons function. We randomly label some points and apply the Self-Training algorithm using a Support Vector Classifier (SVC) with a linear kernel. The scatter plot shows the original labeled points in different colors and the unlabeled points in grey. The algorithm then iteratively labels some of the unlabeled points based on high-confidence predictions and adds them to the training set.


Table of Contents

Content List