Step 1: Learn what is unsupervised machine learning?
An unsupervised machine learning model takes unlabelled (or categorised) data and lets the algorithm determined the answer for us.
The unsupervised machine learning model data without apparent structures and tries to identify some patterns itself to create categories.
Step 2: Understand the main types of unsupervised machine learning
There are two main types of unsupervised machine learning types.
- Clustering: Is used for grouping data into categories without knowing any labels before hand.
- Association: Is a rule-based for discovering interesting relations between variables in large databases.
Step 3: How does K-means work
The K-means works in iterative steps
The k-means algorithm starts is an NP-hard problem, which mean there is no efficient way to solve in the general case. For this problem there are heuristics algorithms that converge fast to local optimum, which means you can find some optimum fast, but it might not be the best one, but often they can do just fine.
How does the algorithm work.
- Step 1: Start by a set of k means. These can be chosen by taking k random point from the dataset (called the Random Partition initialisation method).
- Step 2: Group each data point into the cluster of the nearest mean. Hence, each data point will be assigned to exactly one cluster.
- Step 3: Recalculate the the means (also called centroids) to converge towards local optimum.
Steps 2 and 3 are repeated until the grouping in Step 2 does not change any more.
Step 4: A simple Python example with the k-means algorithm
In this example we are going to start assuming you have the basic knowledge how to install the needed libraries. If not, then see the following article.
First of, you need to import the needed libraries.
import numpy as np import matplotlib.pyplot as plt from matplotlib import style from sklearn.cluster import KMeans
In the first basic example we are only going to plot some points on a graph.
style.use('ggplot') x = [1, 2, 0.3, 9.2, 2.4, 9, 12] y = [2, 4, 2.5, 8.5, 0.3, 11, 10] plt.scatter(x, y) plt.show()
The first line sets a style of the graph. Then we have the coordinates in the arrays x and y. This format is used to feed the scatter.
An advantage of plotting the points before you figure out how many clusters you want to use. Here it looks like there are two “groups” of plots, which translates into using to clusters.
To continue, we want to use the k means algorithm with two clusters.
import numpy as np import matplotlib.pyplot as plt from matplotlib import style from sklearn.cluster import KMeans style.use('ggplot') x = [1, 2, 0.3, 9.2, 2.4, 9, 12] y = [2, 4, 2.5, 8.5, 0.3, 11, 10] # We need to transform the input coordinates to plot use the k means algorithm X =  for i in range(len(x)): X.append([x[i], y[i]]) X = np.array(X) # The number of clusters kmeans = KMeans(n_clusters=2) kmeans.fit(X) labels = kmeans.labels_ # Then we want to have different colors for each type. colors = ['g.', 'r.'] for i in range(len(X)): # And plot them one at the time plt.plot(X[i], X[i], colors[labels[i]], markersize=10) # Plot the centres (or means) plt.scatter(centroids[:, 0], centroids[:, 1], marker= "x", s=150, linewidths=5, zorder=10) plt.show()
This results in the following result.
Considerations when using K-Means algorithm
We could have changed to use 3 clusters. That would have resulted in the following output.
This is not optimal for this dataset, but could be hard to predict without this visual representation of the dataset.
Uses of K-Means algorithm
Here are some interesting uses of the K-means algorithms:
- Personalised marketing to users
- Identifying fake news
- Spam filter in your inbox