Semi-Supervised Learning
Semi-Supervised Learning is a hybrid approach that combines both supervised and unsupervised learning. It utilizes a small amount of labeled data and a large amount of unlabeled data to improve model performance. Semi-supervised learning is particularly useful when obtaining labeled data is expensive or time-consuming.
Label Propagation
Label Propagation is a semi-supervised learning algorithm that propagates labels from labeled data points to unlabeled data points based on their similarity. The underlying assumption is that data points in close proximity are likely to have the same label. Scikit-learn provides an implementation of the Label Propagation algorithm:
# Importing the required libraries import numpy as np from sklearn.datasets import make_circles from sklearn.semi_supervised import LabelPropagation import matplotlib.pyplot as plt # Generate sample data X, y = make_circles(n_samples=100, noise=0.05, random_state=42) # Adding some random labels rng = np.random.RandomState(42) random_unlabeled_points = rng.rand(len(y)) < 0.3 y[random_unlabeled_points] = -1 # Applying Label Propagation label_propagation = LabelPropagation(kernel='knn', n_neighbors=10) label_propagation.fit(X, y) # Plotting the results plt.scatter(X[:, 0], X[:, 1], c=label_propagation.transduction_, cmap='viridis', edgecolors='k') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('Label Propagation in Python') plt.show()
In the code above, we create a sample dataset using Scikit-learn's make_circles function. We randomly label some points and apply the Label Propagation algorithm to propagate labels to the unlabeled points. The scatter plot shows the results with different colors representing different clusters.
Self-Training
Self-Training is another approach in semi-supervised learning where the model starts with a small set of labeled data and iteratively labels the unlabeled data with the highest confidence predictions. The newly labeled data is then added to the training set for the next iteration. Let's see how to implement Self-Training in Python:
# Importing the required libraries import numpy as np from sklearn.datasets import make_moons from sklearn.svm import SVC import matplotlib.pyplot as plt # Generate sample data X, y = make_moons(n_samples=100, noise=0.1, random_state=42) # Adding some random labels rng = np.random.RandomState(42) random_unlabeled_points = rng.rand(len(y)) < 0.5 y[random_unlabeled_points] = -1 # Implementing Self-Training self_training_clf = SVC(kernel='linear', probability=True) self_training_clf.fit(X, y) # Unlabeled points with high confidence predictions pseudo_labeled_indices = np.where(self_training_clf.predict_proba(X)[:, 1] > 0.9) X_pseudo_labeled = X[pseudo_labeled_indices] y_pseudo_labeled = self_training_clf.predict(X)[pseudo_labeled_indices] # Combining labeled and pseudo-labeled data X_combined = np.vstack((X, X_pseudo_labeled)) y_combined = np.hstack((y, y_pseudo_labeled)) # Plotting the results plt.scatter(X_combined[y_combined==-1][:, 0], X_combined[y_combined==-1][:, 1], c='grey', marker='o', label='Unlabeled') plt.scatter(X_combined[y_combined!=-1][:, 0], X_combined[y_combined!=-1][:, 1], c=y_combined[y_combined!=-1], cmap='viridis', edgecolors='k', label='Labeled') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('Self-Training in Python') plt.legend() plt.show()
In the code above, we create a sample dataset using Scikit-learn's make_moons function. We randomly label some points and apply the Self-Training algorithm using a Support Vector Classifier (SVC) with a linear kernel. The scatter plot shows the original labeled points in different colors and the unlabeled points in grey. The algorithm then iteratively labels some of the unlabeled points based on high-confidence predictions and adds them to the training set.