4 Easy Steps to Understand Unsupervised Machine Learning with an Example in Python

Step 1: Learn what is unsupervised machine learning?

An unsupervised machine learning model takes unlabelled (or categorised) data and lets the algorithm determined the answer for us.

Unsupervised Machine Learning model - takes unstructured data and finds patterns itself
Unsupervised Machine Learning model – takes unstructured data and finds patterns itself

The unsupervised machine learning model data without apparent structures and tries to identify some patterns itself to create categories.

Step 2: Understand the main types of unsupervised machine learning

There are two main types of unsupervised machine learning types.

  • Clustering: Is used for grouping data into categories without knowing any labels before hand.
  • Association: Is a rule-based for discovering interesting relations between variables in large databases.

In clustering the main algorithms used are K-means, hierarchy clustering, and hidden Markov model.

And in the association the main algorithm used are Apriori and FP-growth.

Step 3: How does K-means work

The K-means works in iterative steps

The k-means algorithm starts is an NP-hard problem, which mean there is no efficient way to solve in the general case. For this problem there are heuristics algorithms that converge fast to local optimum, which means you can find some optimum fast, but it might not be the best one, but often they can do just fine.

Enough, theory.

How does the algorithm work.

  • Step 1: Start by a set of k means. These can be chosen by taking k random point from the dataset (called the Random Partition initialisation method).
  • Step 2: Group each data point into the cluster of the nearest mean. Hence, each data point will be assigned to exactly one cluster.
  • Step 3: Recalculate the the means (also called centroids) to converge towards local optimum.

Steps 2 and 3 are repeated until the grouping in Step 2 does not change any more.

Step 4: A simple Python example with the k-means algorithm

In this example we are going to start assuming you have the basic knowledge how to install the needed libraries. If not, then see the following article.

First of, you need to import the needed libraries.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

In the first basic example we are only going to plot some points on a graph.

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]
plt.scatter(x, y)
plt.show()

The first line sets a style of the graph. Then we have the coordinates in the arrays x and y. This format is used to feed the scatter.

Output of the plot from scatter plotter in Python.
Output of the plot from scatter plotter in Python.

An advantage of plotting the points before you figure out how many clusters you want to use. Here it looks like there are two “groups” of plots, which translates into using to clusters.

To continue, we want to use the k means algorithm with two clusters.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans

style.use('ggplot')

x = [1, 2, 0.3, 9.2, 2.4,  9, 12]
y = [2, 4, 2.5, 8.5, 0.3, 11, 10]

# We need to transform the input coordinates to plot use the k means algorithm
X = []
for i in range(len(x)):
    X.append([x[i], y[i]])
X = np.array(X)

# The number of clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
labels = kmeans.labels_

# Then we want to have different colors for each type.
colors = ['g.', 'r.']
for i in range(len(X)):
    # And plot them one at the time
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize=10)

# Plot the centres (or means)
plt.scatter(centroids[:, 0], centroids[:, 1], marker= "x", s=150, linewidths=5, zorder=10)
plt.show()

This results in the following result.

Example of k means algorithm used on simple dataset
Example of k means algorithm used on simple dataset

Considerations when using K-Means algorithm

We could have changed to use 3 clusters. That would have resulted in the following output.

Using 3 clusters instead of two in the k-mean algorithm
Using 3 clusters instead of two in the k-mean algorithm

This is not optimal for this dataset, but could be hard to predict without this visual representation of the dataset.

Uses of K-Means algorithm

Here are some interesting uses of the K-means algorithms:

  • Personalised marketing to users
  • Identifying fake news
  • Spam filter in your inbox

Leave a Reply