## What will we cover in this tutorial?

In this tutorial we will show how the Kruskal’s algorithm works. First, we will just give an example of what it does.

Consider the following a undirected edge-weighted graph. The vertices (or nodes) are numbered from 0 to 7 to make it simpler later. It contains edges, which are weighted. That means that the weight between vertex (or node) 0 and 1 is 6. Similar, the weight between vertex (or node) 0 and 2 is 3.

Our goal is to find a minimum spanning tree or forest if graph is not connected. That means we need to find a subset of edges, which connects all vertices (or nodes) with the minimum sum of weight.

First wenNotice, that all the vertices (nodes) are connected and we have chosen the edges the the lowest weight between them.

That is the challenge we want to solve. Given any undirected edge-weighted graph, find the minimum spanning tree of that graph.

In this tutorial we will understand and find a solution using Kruskal’s algorithm, which solves the above problem for an undirected edge-weighted graph.

## Step 1: High level understanding of Kruskal’s algorithm

We can see on wikipedia the following pseudo code is given.

```
algorithm Kruskal(G) is
F:= ∅
for each v ∈ G.V do
MAKE-SET(v)
for each (u, v) in G.E ordered by weight(u, v), increasing do
if FIND-SET(u) ≠ FIND-SET(v) then
F:= F ∪ {(u, v)}
UNION(FIND-SET(u), FIND-SET(v))
return F
```

Which is actually quite nice. So let us first try to understand what happens in the code. Below we describe the pseudo code in bullets from 1 to 5.

- The algorithm
**Kruskal**takes an argument**G**, which represents the graph. - It creates a set
**F**to be the empty set. The set**F**represents the forest, and we will add edges to the forest,**F**, as the algorithm iterates over the edges later. - Then it creates a set for each vertex (or node) in the graph. These sets needs to be represented somehow, which is not given from the algorithm. Also, each set only contains one vertex. As the algorithm iterates along, it will join (or make unions) the sets, as the vertices are joined.
- Then we iterate over the edges, which are ordered in increasing order of the weight. That is, we take the edge with the lowest weight first (if more have the same weight, we just take any of them).
- It will check if the vertices (
**u**and**v**), which the edge connects, are in the same set. - If they are
in the same set, then it will add the edge to the forest,*not***F**, and join the sets to one.

- It will check if the vertices (
- After iterating over all edges, we return the forest
**F**.

## Step 2: A visual walk-through of Kruskal’s algorithm.

Still a lot to figure out. Let’s try with an example. Below illustration depicts the graph, G, given as argument to **Kruskal’s** algorithm in **bullet 1** (in previous Step of the tutorial).

At **bullet 2**, we create an empty set, **F**, to represent the solution.

Then at **bullet 3** we make empty sets of all vertices. We can represent that with a graph with no edges.

At **bullet 4** we start iterating over the edges. There we will take the edge with the lowest weight from **G**. As there are two with weight 1, we can take any. Below we represent the sets as each group of vertices which are connected (only vertex 5 and 7 are connected), the rest are sets of a single vertex. The solution, **F**, is represented by the edges in the diagram.

Then we take the next edge, again it is an edge with weight 1. We check if they are in the same set. Otherwise we do the same. As they are not, we do the same.

And we continue like that.

Again.

Again.

Again.

Again.

Now it will continue to iterate over the edges, but none of them will be added, as we only have one set left, there is no more to be added.

Hence, it will return * F = [(5-7), (2-4), (3-7), (1-2), (5-6), (2-4), (0-2)]* in

**bullet 5**after the last iteration in the loop.

## Step 3: Representing the graph

We need two classes to implement Kruskal’s algorithm. One for representing the graph and one for representing the sets.

Let’s start with the graph. What is relevant in a graph. Let’s take a look at it again.

Here are the features of the class.

- It contains numbered vertices from 0 to 7. That is we need to know how many vertices there are.
- It contains edges between the vertices. They can be represented by two vertex indices. Also, it should have a weight.
- In the pseudo code, we also see that we need to access how many vertices there are. Get the edges and have the sorted by the weight in increasing order.

Now how do you do that?

```
class Graph:
def __init__(self, vertices):
self.vertices = vertices
self.edges = []
def add_edge(self, u, v, weight):
self.edges.append((u, v, weight))
def sort_edges(self):
self.edges = sorted(self.edges, key=lambda item: item[2])
def get_edges(self):
return self.edges
def number_of_vertices(self):
return self.vertices
```

In this way, we can create a graph with a number of vertices and add edges. When we have done that, we can sort them and get the. Also, we can get the number of vertices.

W can represent the above graph by the following code.

```
g = Graph(8)
g.add_edge(0, 1, 6)
g.add_edge(0, 2, 3)
g.add_edge(1, 2, 2)
g.add_edge(1, 3, 4)
g.add_edge(2, 3, 8)
g.add_edge(2, 4, 1)
g.add_edge(3, 4, 4)
g.add_edge(3, 6, 9)
g.add_edge(3, 7, 2)
g.add_edge(4, 5, 5)
g.add_edge(4, 7, 3)
g.add_edge(5, 6, 3)
g.add_edge(5, 7, 1)
g.add_edge(6, 7, 4)
```

That should represent the graph we have used in this tutorial.

## Step 4: Representing the sets in the algorithm

A simple temptation would be to represent the sets in the algorithm as a Python list of sets. But that would not give the time complexity of the algorithm.

Hence, we need to read what is needed. On wikipedia it says, the code is implemented with a disjoint-set data structure. The disjoint-set data structure has very efficient lookups of which set a vertex belongs to, as well as fast unions. That is needed to keep the complexity of Kruskal’s algorithm to be **O(E log(V))**, where **E** is the number of edges and **V** the number of vertices.

The disjoint-data set structure has faster than **O(log(V))** lookup and union operations. Hence, it will keep the time complexity to **O(E log(V))**. More about that later.

First for a implementation of the disjoint-data set structure in Python.

```
class DisjointSet:
def __init__(self, size):
self.parent = []
self.rank = []
for node in range(size):
self.parent.append(node)
self.rank.append(0)
def find(self, v):
if self.parent[v] == v:
return v
else:
return self.find(self.parent[v])
def union(self, u, v):
u_root = self.find(u)
v_root = self.find(v)
if self.rank[u_root] < self.rank[v_root]:
self.parent[u_root] = v_root
elif self.rank[u_root] > self.rank[v_root]:
self.parent[v_root] = u_root
else:
self.parent[v_root] = u_root
self.rank[u_root] += 1
```

We will not go into details of the time complexities of this data structure.

Notice the following.

- The constructor creates the number of disjoint sets, as given as argument in the constructor.
**Find**returns the**root**in the set. Hence, there is one uniquely defined**vertex**in the set that is the**root**. Hence, for any two vertices in the same set, the**find**function will always return the same**root**for each of the two vertices.- The
**union**function joins two sets based on a rank, in order to keep the time complexities.

## Step 5: Kruskal’s algorithm

Now we have the understanding and the two helper classes for the Kruskal’s algorithm. Then the implementation is almost identical to the pseudo code from wikipedia. The pseudo code is given here for comparison.

```
algorithm Kruskal(G) is
F:= ∅
for each v ∈ G.V do
MAKE-SET(v)
for each (u, v) in G.E ordered by weight(u, v), increasing do
if FIND-SET(u) ≠ FIND-SET(v) then
F:= F ∪ {(u, v)}
UNION(FIND-SET(u), FIND-SET(v))
return F
```

And the actual Python code is given here.

```
def kruskal(graph):
forest = []
graph.sort_edges()
disjoint_set = DisjointSet(graph.number_of_vertices())
for (u, v, weight) in graph.get_edges():
if disjoint_set.find(u) != disjoint_set.find(v):
forest.append((u, v, weight))
disjoint_set.union(u, v)
return forest
```

Now let’s try our example given in this tutorial.

Below you have the full code.

```
class Graph:
def __init__(self, vertices):
self.vertices = vertices
self.edges = []
def add_edge(self, u, v, weight):
self.edges.append((u, v, weight))
def sort_edges(self):
self.edges = sorted(self.edges, key=lambda item: item[2])
def get_edges(self):
return self.edges
def number_of_vertices(self):
return self.vertices
class DisjointSet:
def __init__(self, size):
self.parent = []
self.rank = []
for node in range(size):
self.parent.append(node)
self.rank.append(0)
def find(self, v):
if self.parent[v] == v:
return v
else:
return self.find(self.parent[v])
def union(self, u, v):
u_root = self.find(u)
v_root = self.find(v)
if self.rank[u_root] < self.rank[v_root]:
self.parent[u_root] = v_root
elif self.rank[u_root] > self.rank[v_root]:
self.parent[v_root] = u_root
else:
self.parent[v_root] = u_root
self.rank[u_root] += 1
def kruskal(graph):
forest = []
graph.sort_edges()
disjoint_set = DisjointSet(graph.number_of_vertices())
for (u, v, weight) in graph.get_edges():
if disjoint_set.find(u) != disjoint_set.find(v):
forest.append((u, v, weight))
disjoint_set.union(u, v)
return forest
g = Graph(8)
g.add_edge(0, 1, 6)
g.add_edge(0, 2, 3)
g.add_edge(1, 2, 2)
g.add_edge(1, 3, 4)
g.add_edge(2, 3, 8)
g.add_edge(2, 4, 1)
g.add_edge(3, 4, 4)
g.add_edge(3, 6, 9)
g.add_edge(3, 7, 2)
g.add_edge(4, 5, 5)
g.add_edge(4, 7, 3)
g.add_edge(5, 6, 3)
g.add_edge(5, 7, 1)
g.add_edge(6, 7, 4)
result = kruskal(g)
for r in result:
print(r)
```

This code will output.

```
(2, 4, 1)
(5, 7, 1)
(1, 2, 2)
(3, 7, 2)
(0, 2, 3)
(4, 7, 3)
(5, 6, 3)
```

Which is the same edges as we identified manually in Step 2 of this tutorial.

## Bonus: Complexity of the algorithm

Let’s have the algorithm here again.

```
def kruskal(graph):
forest = []
graph.sort_edges()
disjoint_set = DisjointSet(graph.number_of_vertices())
for (u, v, weight) in graph.get_edges():
if disjoint_set.find(u) != disjoint_set.find(v):
forest.append((u, v, weight))
disjoint_set.union(u, v)
return forest
```

We have that **E** is the number of edges and **V** is the number for vertices in the graph.

Then the sort on the graph uses **O(E log(E))**, but as we have that **E** is upper bonded by **V^2**, it follows that** O(log(E)) = O(log(V^2)) = O(2 log(V)) = O(log(V))**.

The for-loop over the edges runs **E** times. In each loop we call **find** and possible **union** on the disjoint-set data structure. These calls where upper bounded by **O(log(V))**. Hence, the main loop runs in **O(E log(V))**.

All in all **O(E log(V) + E log(V)) = O(E log(V))**.

## Learn Python

**Learn Python A BEGINNERS GUIDE TO PYTHON**

- 70 pages to get you started on your journey to
**master Python**. - How to install your setup with
**Anaconda**. - Written
**description**and introduction to all concepts. **Jupyter Notebooks**prepared for 17 projects.

**Python 101: A CRASH COURSE**

- How to
**get started**with this 8 hours.**Python 101: A CRASH COURSE** **Best practices**for learning Python.- How to download the
**material**to follow along and create projects. - A chapter for each lesson with a
**description**,**code****snippets**for easy reference, and links to a**lesson video**.

## Expert Data Science Blueprint

**Expert Data Science Blueprint**

- Master the Data Science Workflow for actionable data insights.
- How to download the material to follow along and create projects.
- A chapter to each lesson with a Description, Learning Objective, and link to the lesson video.

## Machine Learning

**Machine Learning – The Simple Path to Mastery**

- How to get started with Machine Learning.
- How to download the material to follow along and make the projects.
- One chapter for each lesson with a Description, Learning Objectives, and link to the lesson video.