Machine Learning

Machine Learning: A Practical Guide for Aspiring Data Scientists

38 min read

Unsupervised Learning

Unsupervised Learning is a type of Machine Learning where the model learns patterns and structures from unlabeled data without any specific target variable to predict. It aims to discover hidden patterns or groupings within the data. Unsupervised Learning can be further divided into two primary categories: clustering and dimensionality reduction.

Clustering Algorithms

Clustering algorithms group similar data points into clusters based on their similarity. These algorithms are used to discover underlying patterns or segment data into meaningful groups. Here are some popular clustering algorithms implemented in Python:

  • K-Means Clustering: A widely used clustering algorithm that partitions data into k clusters based on the similarity of data points to the cluster centroids.
  • Hierarchical Clustering: Creates a tree-like structure of nested clusters by merging or splitting data points iteratively.
  • Density-Based Clustering (DBSCAN): Identifies clusters as dense regions of data points separated by sparse regions.

Let’s implement some of these clustering algorithms using Scikit-learn in Python:

# Importing the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
# Sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
# K-Means Clustering
kmeans = KMeans(n_clusters=4)
kmeans_labels = kmeans.fit_predict(X)
# Hierarchical Clustering (Agglomerative)
hierarchical = AgglomerativeClustering(n_clusters=4)
hierarchical_labels = hierarchical.fit_predict(X)
# Density-Based Clustering (DBSCAN)
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
# Plotting the results
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', edgecolors='k', label='K-Means')
plt.scatter(X[:, 0], X[:, 1], c=hierarchical_labels, cmap='plasma', edgecolors='k', label='Hierarchical')
plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='inferno', edgecolors='k', label='DBSCAN')
plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Clustering Algorithms in Python')
plt.show()

In the code above, we create sample data using Scikit-learn’s make_blobs function. We then apply K-Means, Hierarchical Clustering (Agglomerative), and DBSCAN to cluster the data points. The scatter plot shows the clusters obtained by each algorithm.

Dimensionality Reduction

Dimensionality reduction techniques aim to reduce the number of features in a dataset while preserving the most relevant information. This is useful for visualization and handling high-dimensional data. One of the most common techniques used for dimensionality reduction is Principal Component Analysis (PCA). Let’s see how to implement PCA in Python:

# Importing the required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
# Load digits dataset
digits = load_digits()
X, y = digits.data, digits.target
# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plotting the results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', edgecolors='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA in Python')
plt.show()

In the code above, we use the digits dataset from Scikit-learn, which contains 8×8 images of handwritten digits. We apply PCA to reduce the dimensionality to two principal components and plot the data points in a 2D space.


Table of Contents

Content List