All these tutorial are written by me as a freelancing working for tutorial project AlgoDaily. These has been slightly changed and more lessons after lesson 12 has been added to the actual website. Thanks to Jacob, the owner of AlgoDaily, for letting me author such a wonderful Machine Learning tutorial series. You can sign up there and get a lot of resources related to technical interview preparation.

Machine Learning Made Easy: Scikit-Learn (Part-2: Unsupervised Learning)

Introduction

Previously, we have understood the core concepts of supervised machine learning and how to use most of the common implementations of many preprocessing techniques, models, etc. with scikit-learn. In this lesson, we will be covering most approaches and techniques used for unsupervised learning using the power of scikit-learn.

The Basics

As a recap, Scikit-learn tries to divide everything related to Machine Learning into the following categories.

Supervised Learning:
a. Classification Problem
b. Regression Problem
Unsupervised Learning
a. Clustering
b. Dimensionality Reduction
c. Density Estimation

Based on these classifications, sklearn has tried a uniform approach for all the data, transformation, preprocessing, and models. In the previous lesson, we have gone through the supervised learning part in this category. In this lesson, we will have a look at unsupervised learning. Let us discuss each of these one by one.

What is unsupervised learning?

According to Wikipedia,

Unsupervised learning (UL) is a type of algorithm that learns patterns from untagged data.

So you will have data, where you do not have any target label to train on. For example, suppose you have 1000 images of cats and dogs, but the images are not labeled as which ones are cats and which ones are dogs. If you then try to create a partition and separate two animals using machine learning methods, that will be called unsupervised learning.

Most people think that unsupervised learning is a little harder than supervised learning. Because it is difficult to determine the loss or goodness of a model as there is no label to compare with. For clustering a dataset, you can use the distance between the mean of two clusters or inverse of distance among points in similar clusters as a goodness of the model. In this lesson, we will go through data clustering, outlier detection, data augmentation, synthesis, and many other unsupervised learning applications.

Clustering a dataset

Clustering is a very if not most popular algorithm definition for unsupervised learning. Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. And each of these groups is called a cluster. For example, suppose you are at a party where everybody is wearing dresses of different colors. Now you are playing a game, where everyone with the same color dress, will come along together. So finally all people wearing red will be together, all persons wearing blue will stand together and same for the other colors. This is called clustering based on the color of each person’s dress.

In the example above, the whole game can a called a clustering algorithm. Here, the similarity between two persons (or objects) is the color. So color is the clustering feature. The difference between the color of two dresses is the distance between two objects while clustering. We could use other clustering features along with the color, for example where each person lives.

There are many clustering algorithms to use when clustering a dataset. We will start with the most common ones and show you how simple it is to implement one using scikit-learn.

Types of clustering

Previously, we have learned everything about K-means clustering. But there are hundreds of more clustering algorithms to use depending on your needs. From the top view, there are two types of clustering that you need to know.

Divisive Clustering: In this algorithm, we start from only one cluster. At each iteration, we divide clusters to make more clusters. For example, let’s say we have a dataset of 4 points ${A, B, C, D}$. We first divide that dataset into two clusters, ${{A,B}, {C, B}}$. Later, if needed
Agglomerative Clustering: In agglomerative clustering, we start by assuming each point as its own cluster. So if we have $n$ sample points, then we will start with $n$ clusters. At each iteration, we will merge the two closest clusters into one. Keep in mind that the definition of closest cluster may vary from application to application. But it must follow the Pythagorean distance formula. After some iterations, we will end up in only one cluster.

K-Means Clustering

Mathematically, given a set of observations $(x_1, x_2, …, x_n)$, where each observation is a d-dimensional real vector, k-means clustering aims to partition the $n$ observations into $k (≤ n)$ sets $S = {S_1, S_2, …, S_k}$ so as to minimize the within-cluster sum of squares (i.e. variance). Formally, the objective is to find:

$$arg\ min_s \sum_{i=1}^{k}{\sum_{x \in S_i}{||x - \mu_i||^2}} = arg\ min_s \sum_{i=1}^{k}{|S_i|}\ Var\ S_i$$

where $\mu_i$ is the mean of points in $S_i$. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster:

$$arg\ min_s \sum_{i = 1}^{k}{\frac{1}{2 |S_i|}} \sum_{x,y \in S_i}{||x-y||^2}$$

The equivalence can be deduced from identity:

$$\sum_{x \in S_i}{||x-\mu_i||^2} = \sum_{x \neq y \in S_i}{(x-\mu_i)^T (\mu_i - y)}$$

Because the total variance is constant, this is equivalent to maximizing the sum of squared deviations between points in different clusters (between-cluster sum of squares, BCSS), which follows from the law of total variance.

If you are like me and do not understand anything written above, just see that last equation again and think of our objective to minimize the term with respect to $S$.

The unsupervised k-means algorithm has a loose relationship to the k-nearest neighbor classifier, a popular supervised machine learning technique for classification that is often confused with k-means due to the name.

Here, the term “k-means” means that we will have a total of $k$ points in the feature space. These k points are known as centroids of each cluster. Let’s first implement a K-means clustering from scratch. Our steps will be:

1
2
3
4
5
1. Randomly pick k data points as our initial Centroids.
2. Find the distance (Euclidean distance for our purpose) between each data points in our training set with the k centroids.
3. Now assign each data point to the closest centroid according to the distance found.
4. Update centroid location by taking the average of the points in each cluster group.
5. Repeat Steps 2 to 4 till our centroids don’t change.

K-Means from scratch

First, we will need a dataset. We will randomly generate a dataset for clustering. Luckily, sklearn gifts us a function named make_blob in its datasets module. This function generates isotropic Gaussian blobs for clustering experimentation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import numpy as np
from sklearn.datasets import make_blobs

n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1])
plt.title("Unclustered")

This will give us a plot of the unclustered dataset retrieved from make_blobs. Visually you can see there are 3 clusters in the dataset.

unclustered

To cluster this dataset with K-means, we first need to randomly initialize 3 centroid points within the 2D space. We can do this by the following code:

1
2
#Randomly choosing Centroids
centroids = x[np.random.choice(len(x), k)]

After defining the centroids, we have to calculate the distance of all the points to these centroids. We could use a loop over all the points and calculate the distance manually like below:

1
2
3
4
5
6
distances = []
for point in X:
    dist_tmp = []
    for centroid in centroids:
        dist_tmp.append(sklearn.metrics.pairwise.euclidean_distances(point, centroids))
    distances.append(dist_tmp)

But luckily, sklearn provides a convenient function that does this whole thing in a one-line function call.

1
2
3
#finding the distance between centroids and all the data points
from scipy.spatial.distance import cdist 
distances = cdist(x, centroids ,'euclidean')

Now that we have the distances of each point to all the centroids, we will mark each point as a member of that cluster, whose centroid is the closest. For example, a point has a distance like [5.5, 2.8, 8.3], so it is 5.5 away from centroid 0, 2.8 away from centroid 1, and 8.3 away from centroid 2. If we mark this based on the closest centroid, then the point should be marked as 1. So we can use the np.argmin function to get the index of the minimum distance and use it as centroid index.

1
2
#Centroid with the minimum Distance
points = np.array([np.argmin(i) for i in distances])

Now all the points are marked into a cluster. But this is just the first iteration (and initialization) of the K-means clustering algorithm. We have to do this for several iterations (say n_iterations).

1
2
3
4
5
6
7
8
9
for _ in range(n_iteration): 
    centroids = []
    for idx in range(k):
        temp_cent = x[points==idx].mean(axis=0) 
        centroids.append(temp_cent)

    centroids = np.vstack(centroids) # Updated Centroids
    distances = cdist(x, centroids ,'euclidean')
    points = np.array([np.argmin(i) for i in distances])

If we wrap the whole thing inside a function, it should look something like below. Also, this will work with any dimension of the points.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def kmeans(x, k, n_iteration):
    centroids = x[np.random.choice(len(x), k)]
    distances = cdist(x, centroids ,'euclidean')
    points = np.array([np.argmin(i) for i in distances])

    for _ in range(n_iteration): 
        centroids = []
        for idx in range(k):
            temp_cent = x[points==idx].mean(axis=0) 
            centroids.append(temp_cent)

        centroids = np.vstack(centroids) # Updated Centroids
        distances = cdist(x, centroids ,'euclidean')
        points = np.array([np.argmin(i) for i in distances])
        
    return points

After applying this function, we can see the result very impressive for our dummy blob dataset.

1
2
3
4
5
y_pred = kmeans(X, 3, 100)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Our K-means implementation")

out impl

K-means with sklearn

All the things we did above can be done with sklearn far more easily. We instantiate a KMeans algorithm object/model and run it over our blob dataset.

1
2
3
4
5
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("K-Means sklearn")

The result is quite similar.

k-means sklearn

But for real data, we might even not know the potential number of clusters. There are many ways to also learn the number of clusters. We can choose the optimal value of K (Number of Clusters) using methods like the The Elbow method. Scikit-learn also has an implementation of this algorithm.

DBSCAN clustering

The full form of DBSCAN is “Density-Based Spatial Clustering of Applications with Noise”. So it is a density-based clustering non-parametric algorithm. Given a set of sample points, the algorithm groups together points that are closely packed together (points with many nearby neighbors), marking them as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and is also most cited in scientific literature.

The benefit of the DBSCAN algorithm is that it can identify the shape of a cluster and mark all points of that shape into a single cluster. The is not possible in the case of K-Means clustering algorithm. See the image below to understand the strength of DBSCAN.

kmeans-dbscan

As you can see, if the pattern of the sample points is a little complex, you should go for DBSCAN instead of K-Means clustering. Let us see how you can implement DBSCAN. You can see a more detailed example of this here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.cluster import DBSCAN
n_samples = 1500
X, y = make_blobs(n_samples=n_samples)
# Try to modify eps according to your needs
db = DBSCAN(eps=0.8, min_samples=2).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Black removed and is used for noise instead.
unique_labels = set(db.labels_)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (db.labels_ == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('DBSCAN clustering')
plt.show()

The output will look something like the below:

DBSCAN

Other Clustering Algorithms

I am just showing you only 2 popular clustering algorithms. But there are a lot more. Each algorithm has its strength and weakness. A comparison and implementation of 10 popular clustering algorithms are shown in the below code. The clustering algorithms used for this code is the following:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
print(__doc__)

import time
import warnings

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets, mixture
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice

np.random.seed(0)

# ============
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
# ============
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
                                      noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None

# Anisotropicly distributed data
random_state = 170
X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)
aniso = (X_aniso, y)

# blobs with varied variances
varied = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[1.0, 2.5, 0.5],
                             random_state=random_state)

# ============
# Set up cluster parameters
# ============
plt.figure(figsize=(9 * 2 + 3, 13))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.95, wspace=.05,
                    hspace=.01)

plot_num = 1

default_base = {'quantile': .3,
                'eps': .3,
                'damping': .9,
                'preference': -200,
                'n_neighbors': 10,
                'n_clusters': 3,
                'min_samples': 20,
                'xi': 0.05,
                'min_cluster_size': 0.1}

datasets = [
    (noisy_circles, {'damping': .77, 'preference': -240,
                     'quantile': .2, 'n_clusters': 2,
                     'min_samples': 20, 'xi': 0.25}),
    (noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
    (varied, {'eps': .18, 'n_neighbors': 2,
              'min_samples': 5, 'xi': 0.035, 'min_cluster_size': .2}),
    (aniso, {'eps': .15, 'n_neighbors': 2,
             'min_samples': 20, 'xi': 0.1, 'min_cluster_size': .2}),
    (blobs, {}),
    (no_structure, {})]

for i_dataset, (dataset, algo_params) in enumerate(datasets):
    # update parameters with dataset-specific values
    params = default_base.copy()
    params.update(algo_params)

    X, y = dataset

    # normalize dataset for easier parameter selection
    X = StandardScaler().fit_transform(X)

    # estimate bandwidth for mean shift
    bandwidth = cluster.estimate_bandwidth(X, quantile=params['quantile'])

    # connectivity matrix for structured Ward
    connectivity = kneighbors_graph(
        X, n_neighbors=params['n_neighbors'], include_self=False)
    # make connectivity symmetric
    connectivity = 0.5 * (connectivity + connectivity.T)

    # ============
    # Create cluster objects
    # ============
    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])
    ward = cluster.AgglomerativeClustering(
        n_clusters=params['n_clusters'], linkage='ward',
        connectivity=connectivity)
    spectral = cluster.SpectralClustering(
        n_clusters=params['n_clusters'], eigen_solver='arpack',
        affinity="nearest_neighbors")
    dbscan = cluster.DBSCAN(eps=params['eps'])
    optics = cluster.OPTICS(min_samples=params['min_samples'],
                            xi=params['xi'],
                            min_cluster_size=params['min_cluster_size'])
    affinity_propagation = cluster.AffinityPropagation(
        damping=params['damping'], preference=params['preference'])
    average_linkage = cluster.AgglomerativeClustering(
        linkage="average", affinity="cityblock",
        n_clusters=params['n_clusters'], connectivity=connectivity)
    birch = cluster.Birch(n_clusters=params['n_clusters'])
    gmm = mixture.GaussianMixture(
        n_components=params['n_clusters'], covariance_type='full')

    clustering_algorithms = (
        ('MiniBatch\nKMeans', two_means),
        ('Affinity\nPropagation', affinity_propagation),
        ('MeanShift', ms),
        ('Spectral\nClustering', spectral),
        ('Ward', ward),
        ('Agglomerative\nClustering', average_linkage),
        ('DBSCAN', dbscan),
        ('OPTICS', optics),
        ('BIRCH', birch),
        ('Gaussian\nMixture', gmm)
    )

    for name, algorithm in clustering_algorithms:
        t0 = time.time()

        # catch warnings related to kneighbors_graph
        with warnings.catch_warnings():
            warnings.filterwarnings(
                "ignore",
                message="the number of connected components of the " +
                "connectivity matrix is [0-9]{1,2}" +
                " > 1. Completing it to avoid stopping the tree early.",
                category=UserWarning)
            warnings.filterwarnings(
                "ignore",
                message="Graph is not fully connected, spectral embedding" +
                " may not work as expected.",
                category=UserWarning)
            algorithm.fit(X)

        t1 = time.time()
        if hasattr(algorithm, 'labels_'):
            y_pred = algorithm.labels_.astype(int)
        else:
            y_pred = algorithm.predict(X)

        plt.subplot(len(datasets), len(clustering_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
                                             '#f781bf', '#a65628', '#984ea3',
                                             '#999999', '#e41a1c', '#dede00']),
                                      int(max(y_pred) + 1))))
        # add black color for outliers (if any)
        colors = np.append(colors, ["#000000"])
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])

        plt.xlim(-2.5, 2.5)
        plt.ylim(-2.5, 2.5)
        plt.xticks(())
        plt.yticks(())
        plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
                 transform=plt.gca().transAxes, size=15,
                 horizontalalignment='right')
        plot_num += 1

plt.show()

The resulting image is like below:

sklearn-clustering

Dimensionality Reduction

Before understanding the importance of data visualization, let us see 2 scenarios first:

Suppose you have a dataset with more than 3 attributes and you want to visualize them. How can you do that in a 2D or 3D space? The most obvious answer is to select only 2/3 attributes and visualize using them. You can also create multiple plots each representing 2/3 attributes of the dataset.
Suppose you have a very dataset with millions of attributes and you want to train a large model (say a neural network) using that data. You find that the training and prediction of the model are very slow and won’t fit your needs. So you can remove some attributes from the dataset and then use them for the model. Now, which attributes will you remove first? What if some of the attributes are more important than others? How do you determine that?

In these scenarios, a dimension reduction will help you a lot. You can create a new dataset from these where the number of attributes is less. You can also get the importance of each attribute in your dataset. Here the importance is derived from how much the dataset is diverse based on that attribute.

There are a lot of algorithms for Dimensionality Reduction. We will discuss only two of these methods: PCA and t-SNE.

Principal Components Analysis

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. The main objective of PCA is to reduce the number of variables of a data set while preserving as much information as possible. The steps of PCA is as follows:

Standardize the Dataset
Compute the Covariance Matrix
Compute Eigen Values and Eigen Vectors from Covariance Matrix
Sort the attributes according to their respecting Eigen Values and select the first required number of columns.

After standardizing the dataset, all the variables will be transformed to the same scale. So the range of an attribute will not apply any effect on PCA. The covariance matrix is a $nxn$ symmetric matrix (where $n$ is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a $3x3$ matrix of this from:

$$\begin{bmatrix}
Cov(x,x) & Cov(x,y) & Cov(x,z)\
Cov(y,x) & Cov(y,y) & Cov(y,z)\
Cov(z,x) & Cov(z,y) & Cov(z,z)
\end{bmatrix}$$

After getting the covariance matrix, we calculate the eigenvalues and eigenvectors. We are going to have $n$ eigenvalues $\lambda_1, \lambda_2 … \lambda_n$. They are obtained by solving the equation given in the expression below:

$$|A - \lambda I_n| = 0$$

Here $I_n$ is an identity matrix of shape $nxn$. The corresponding eigenvectors $e_1, e_2… e_n$ are obtained by solving the expression below:

$$(A - \lambda_j I_n) e_j = 0$$

Here, we have the difference between the matrix $A$ minus the $j^{th}$ eigenvalue times the Identity matrix, this quantity is then multiplied by the $j^{th}$ eigenvector and set it all equal to zero. This will obtain the eigenvector $e_j$ associated with eigenvalue $\mu_j$.

PCA from scratch

Luckily, NumPy has all the functions to calculate these in one line. Let us implement this in code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# mean Centering the data  
X_meaned = X - np.mean(X , axis = 0)

# calculating the covariance matrix of the mean-centered data.
cov_mat = np.cov(X_meaned , rowvar = False)

#Calculating Eigenvalues and Eigenvectors of the covariance matrix
eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)

#sort the eigenvalues in descending order
sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]

After that, we can create a subset of our attributes and eigenvectors. We can also transform our original data using the eigenvectors.

1
2
3
4
5
6
7
# select the first n eigenvectors, n is desired dimension
# of our final reduced data.
eigenvector_subset = sorted_eigenvectors[:,0:n_components]

# We can also create the reduced dataset that has lower dimension and represents the actual
# dataset as much as possible
X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()

If we wrap the whole process into a function, the whole code should look something like this. We are applying the algorithm on our well-known Iris dataset taken from sklearn.dataset submodule. Iris has 4 dimensions, we will reduce it to 2 dimensions for visualization in a 2D plot.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def PCA(X , n_components):
    
    # mean Centering the data  
    X_meaned = X - np.mean(X , axis = 0)

    # calculating the covariance matrix of the mean-centered data.
    cov_mat = np.cov(X_meaned , rowvar = False)

    #Calculating Eigenvalues and Eigenvectors of the covariance matrix
    eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)

    #sort the eigenvalues in descending order
    sorted_index = np.argsort(eigen_values)[::-1]
    sorted_eigenvalue = eigen_values[sorted_index]
    sorted_eigenvectors = eigen_vectors[:,sorted_index]

    # select the first n eigenvectors, n is desired dimension
    # of our final reduced data.
    eigenvector_subset = sorted_eigenvectors[:,0:n_components]

    # We can also create the reduced dataset that has lower dimension and represents the actual
    # dataset as much as possible
    X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()
    
    return X_reduced



iris = datasets.load_iris()
X = iris.data
y = iris.target
n_components = 2
X_reduced = PCA(X, n_components)

# For visualization
principal_df = pd.DataFrame(X_reduced , columns = ['PC1','PC2'])

# Using seaborn for cute display
sb.scatterplot(data = principal_df , x = 'PC1',y = 'PC2' , hue = y , s = 60)

The result is quite impressive as we can see three types of plants being partitions into their own spaces. This is the power of PCA.

PCA

PCA using sklearn

Now lets see the same in scikit-learn.

1
2
from sklearn.decomposition import PCA
X_reduced = PCA(n_components=2).fit(X).transform(X)

DONE. It is just that much simple.

t-SNE

When it comes to Dimensionality Reduction, PCA is famous because it is simple, fast, easy to use, and retains the overall variance of the dataset. Though PCA is great, it does have some drawbacks. One of the major drawbacks of PCA is that it does not retain non-linear variance. This means PCA will not be able to get good results for the dataset when you want to preserve the overall structure of the dataset.

In PCA, the relative distance between two sample points is not maintained, thus the dataset structure is deformed. But using t-SNE, you can reduce the dimension of the dataset preserving the relative distance between each sample point.

t-SNE is a nonlinear dimensionality reduction technique that is well suited for embedding high dimension data into lower dimensional data (2D or 3D) for data visualization using t-Distribution. t-SNE stands for t-distributed Stochastic Neighbor Embedding. Let us break each letter and try to understand them.

Stochastic: Not definite but random probability
Neighbor: Concerned only about retaining the variance of neighbor points
Embedding: Plotting data into lower dimensions

To apply t-SNE, we need to follow these steps:

Construct a probability distribution on pairs in higher dimensions such that similar objects are assigned a higher probability and dissimilar objects are assigned a lower probability.
Replicate the same probability distribution on lower dimensions iteratively till the Kullback-Leibler divergence is minimized.

We will not implement any more algorithms from scratch and make this lesson unnecessarily long. So let use t-SNE using scikit-learn.

1
2
3
from sklearn.manifold import TSNE
X_reduced = TSNE(n_components=2).fit_transform(X)
# Visualize

The visualization part is the same as above. The output of the algorithm on the Iris dataset is given below:

t-sne

As you can see, the two labels are very close to each other and hard to distinguish. The later label can be very easily separated from the dataset. You do not get this distance between each sample using the PCA algorithm.

One important thing about t-SNE is its perplexity. This single hyperparameter can change the whole transformation of the dataset. The larger the perplexity, the more non-local information will be retained in the dimensionality reduction result.

Let us see the effect of different perplexity values on the IRIS dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt

from matplotlib.ticker import NullFormatter
from sklearn import manifold, datasets
from time import time

n_samples = X.shape[0]
n_components = 2
(fig, subplots) = plt.subplots(1, 5, figsize=(20, 4))
perplexities = [1, 5, 30, 50, 100]

iris = datasets.load_iris()
X = iris.data
y = iris.target

red = y == 0
green = y == 1
blue = y == 2

for i, perplexity in enumerate(perplexities):
    ax = subplots[i]

    t0 = time()
    tsne = manifold.TSNE(n_components=n_components, init='random',
                         random_state=0, perplexity=perplexity)
    Y = tsne.fit_transform(X)
    t1 = time()
    print("circles, perplexity=%d in %.2g sec" % (perplexity, t1 - t0))
    ax.set_title("Perplexity=%d" % perplexity)
    ax.scatter(Y[red, 0], Y[red, 1], c="r")
    ax.scatter(Y[green, 0], Y[green, 1], c="g")
    ax.scatter(Y[blue, 0], Y[blue, 1], c="b")
    ax.xaxis.set_major_formatter(NullFormatter())
    ax.yaxis.set_major_formatter(NullFormatter())
    ax.axis('tight')

plt.show()

The result is given below:

tsne-perplexity

As you can see, the perfect perplexity stays between 30 to 50. Less perplexity pays more attention to the global information and increases the distance of some points from all other points too high. On the other hand, large perplexity pays a lot of attention to local information and tries to keep the distance of points in the same cluster higher than the intercluster distance.

There is also a very popular dimension reduction method. It is based on deep neural networks and is known as AutoEncoders. We will learn this in a later lesson after we cover the basics of Neural Networks.

Some words about Reinforcement Learning

Reinforcement Learning is another type of Machine Learning technique that is neither Supervised Learning nor Unsupervised Learning. Reinforcement learning is the training of machine learning models to make a sequence of decisions. Via this algorithm computers learn to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, artificial intelligence faces a game-like situation.

Suppose you are in a game environment. At first, you do not even know your goal. You do not what to do, and how to complete the game. Slowly you move forward and see a coin standing in a corner. You proceed to the coin and take it. The moment you pick up the coin, you see your score going up. You know that a higher score is always better. So you decide to take as much coin as possible throughout the game.

Again you see a thorn mounted on a wall. You try to touch that thorn and suddenly see your health go down. Eventually, if you keep touching that, your health will deplete and you will die. After understanding this, you decide to stay away from those for the entire game.

A virtual agent in the algorithm is just like you. He gets a reward in the game when moved towards a goal, and gets punished when moves away from the goal. There is a lot of live competition happening for reinforcement learning. For example, there are Halite.io, crowdanalytix, codalab, ods.ai etc.

Solving different dataset problems with different approaches

As of the official guide of Scikit-learn, often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems.

The flowchart below is designed to give users a bit of a rough guide on how to approach problems concerning which estimators to try on your data.

choosing estimator

Conclusion

This lesson is the wrapping up of the Scikit-learn framework in Machine Learning. Now you can be confident that you can get started with both supervised and unsupervised learning algorithms. In a later lesson, when we will learn about deep learning algorithms, you will understand that the basic ideas of these two lessons on scikit-learn still apply. Apply some algorithms to your own needs and share that awesomeness in the discussion section.

Machine Learning Tutorial - Lesson 07.5

Lesson 07.5 out of 12

Machine Learning Made Easy: Scikit-Learn (Part-2: Unsupervised Learning)

Introduction

The Basics

What is unsupervised learning?

Clustering a dataset

Types of clustering

K-Means Clustering

K-Means from scratch

K-means with sklearn

DBSCAN clustering

Other Clustering Algorithms

Dimensionality Reduction

Principal Components Analysis

PCA from scratch

PCA using sklearn

t-SNE

Some words about Reinforcement Learning

Solving different dataset problems with different approaches

Conclusion

Machine Learning Tutorial - Lesson 07.5

Lesson 07.5 out of 12

Machine Learning Made Easy: Scikit-Learn (Part-2: Unsupervised Learning)

Introduction

The Basics

What is unsupervised learning?

Clustering a dataset

Types of clustering

K-Means Clustering

K-Means from scratch

K-means with sklearn

DBSCAN clustering

Other Clustering Algorithms

Dimensionality Reduction

Principal Components Analysis

PCA from scratch

PCA using sklearn

t-SNE

Some words about Reinforcement Learning

Solving different dataset problems with different approaches

Conclusion

See Also