## What will we cover in this tutorial?

We will look at how the Birthday Paradox is used when estimating how collision resistance a hash function is. This tutorial will show that a good estimate is that a **n**-bit hash function will have collision by chance with **n/2**-bit random hash values.

## Step 1: Understand a hash function

A hash function is a one-way function with a fixed output size. That is, the output has the same size and it is difficult to find two distinct input chucks, which give the same output.

A

https://en.wikipedia.org/wiki/Hash_functionhash functionis any function that can be used to map data of arbitrary size to fixed-size values.

Probably the best know example of a hash-function is the MD5. It was designed to be used as a cryptographic hash function, but has been found to have many vulnerabilities.

**Does this mean you should not use the MD5 hash function?**

That depends. If you use it in a cryptographic setup, the answer is ** Do not use**.

On the other hand, hash function are often used to calculate identifiers. For that purpose, it also depends if you should use it or not.

This is where the Birthday Paradox comes in.

## Step 2: How are hash functions and the Birthday Paradox related?

Good question. First recall what the Birthday Paradox states.

https://www.learnpythonwithrune.org/birthday-paradox-by-example-it-is-not-a-paradox/

…in a random group of 23 people, there is about a 50 percent chance that two people have the samebirthday

How can that be related to hash functions? There is something about collisions, right?

Given 23 people, we have 50% chance of collision (two people with the same birthday).

Hence, if we have that our hash functions maps data to a day in the calendar year. That is, it maps **hash(data) -> [0, 364]**, then given 23 hash values, we have 50% chance for collision.

But you also know that our hash function maps to more than 365 distinct values. Actually, the **MD5** maps to **2^128** distinct values.

An example would be appreciated now. Let us make a simplified hash function, call it **MD5′** (md5-prime), which maps like the MD5, but only uses the first byte of the result.

That is, we have **MD5′(data) -> [0, 255]**.

Surely, by the pigeonhole principle we would run out of possible values after 256 distinct data input to **MD5′ **and have a collision.

import hashlib import os lookup_table = {} collision_count = 0 for _ in range(256): random_binary = os.urandom(16) result = hashlib.md5(random_binary).digest() result = result[:1] if result in lookup_table: print("Collision") print(random_binary, result) print(lookup_table[result], result) collision_count += 1 else: lookup_table[result] = random_binary print("Number of collisions:", collision_count)

The **lookup_table** is used to store the already seen hash values. We will iterate over the 256 (one less than possible values of our **MD5′** hash function). Take some random data and hash it with **md5** and only use first byte (8 bits). If result already exists in **lookup_table** we have a collision, otherwise add it to our **lookup_table**.

For a random run of this I got 87 collisions. Expected? I would say so.

Let us try to use the Birthday Paradox to estimate how many hash values we need to get a collision of our **MD5′** hash function.

A rough estimate that is widely used, is that the square root of the number of possible outcomes will give a 50% chance of collision (see wikipedia for approximation).

That is, for **MD5′(data) -> [0, 255]** it is, **sqrt(256) = 16**. Let’s try that.

import hashlib import os collision = 0 for _ in range(1000): lookup_table = {} for _ in range(16): random_binary = os.urandom(16) result = hashlib.md5(random_binary).digest() result = result[:1] if result not in lookup_table: lookup_table[result] = random_binary else: collision += 1 break print("Number of collisions:", collision, "out of", 1000)

Which gives some like this.

Number of collisions: 391 out of 1000

That is in the lower end, but still a reasonable approximation.

## Step 3: Use a correct data structure to lookup in

Just to clarify. We will not find collisions on the full MD5 hash function, but we will try to see if the estimate of collision is reasonable.

This requires to do a lot of calculations and we want to ensure that we are not having a bottleneck with using a wrong data structure.

The Python dict should be a hash table with expected insert and lookup O(1). Still the worst case is O(n) for these operations, which would be a big overhead to cary along the way. Hence, we will first test, that the dictionary has O(1) insert and lookup time for the use cases we have of it here.

import time import matplotlib.pyplot as plt def dict_size(size): start = time.time() dict = {} for i in range(size): if i in dict: print("HIT") else: dict[i] = 0 return time.time() - start x = [] y = [] for i in range(0, 2**20, 2**12): performance = dict_size(i) x.append(i) y.append(performance) plt.scatter(x, y, alpha=0.1) plt.xlabel("Size") plt.ylabel("Time (sec)") plt.show()

Resulting in something like this.

What does that tell us? That the dict in Python has a approximately linear insert and lookup time, that is O(1). But there some overhead at some sizes, e.g. a bit before 3,000,000. It is not exactly linear, but close enough not to expect a exponential run time.

This step is not necessary, but it is nice to know how the function grows in time, when we want to check for collisions. If the above time complexity grew exponentially (or not linearly), then it can suddenly become hard to estimate the runtime if we run for a bigger space.

## Step 4: Validating if square root of the bit size is a good estimate for collision

We will continue our journey with our modified **MD5′** hash function, where the output space will be reduced.

We will then for various output space sizes see if the estimate for 50% collision of the hash functions is decent. That is, if we need approximately **sqrt(space_size)** of hash values to have an approximately 50% chance of a collision.

This can be done by the following code.

import hashlib import os import time import matplotlib.pyplot as plt def main(bit_range): start = time.time() collision_count = 0 # Each space_size counts for 4 bits, hence we have space_size = bit_range//4 for _ in range(100): lookup_table = {} # Searching half the sqrt of the space for collision # sqrt(2**bit_range) = 2**(bit_range//2) for _ in range(2**(bit_range//2)): random_binary = os.urandom(16) result = hashlib.md5(random_binary).hexdigest() result = result[:space_size] if result in lookup_table: collision_count += 1 break else: lookup_table[result] = random_binary return time.time() - start, collision_count x = [] y1 = [] y2 = [] for i in range(4, 44, 4): performance, count = main(i) x.append(i) y1.append(performance) y2.append(count) _, ax1 = plt.subplots() plt.xlabel("Size") plt.ylabel("Time (sec)") ax1.scatter(x, y1) ax2 = ax1.twinx() ax2.bar(x, y2, align='center', alpha=0.5, color='red') ax2.set_ylabel("Collision rate (%)", color='red') ax2.set_ylim([0, 100]) plt.show()

The estimated collision rate is very rough, as it only runs 100 trials for each space size.

The result are shown in the graph below.

Interestingly, it seems to be in the 30-50% range for most cases.

As a note, it might confuse that the run-time (the dots), does not seem to be linear. That is because for each bit-size we increase, we double the space. Hence, the x-axis is a logarithmic scale.

## Step 5: What does that all mean?

This has high impact on using hash functions for creating unique identifiers. If you want a short identifier with the least number of bits, then you need to consider the Birthday Paradox.

Assume you created the following service.

import hashlib import base64 def get_uid(text): result = hashlib.md5(text.encode()).digest() result = base64.b64encode(result) return result[:4] uid = get_uid("my text") print(uid)

If the input text can be considered random, how resistant is **get_uid(…)** function against collision.

Well, it returns 4 base64 characters. That is **6*4 = 24** bits of information (each base 64 character contains 6 bits of information). The rough estimate is that if you use it **sprt(2^24) = 2^12 = 4,096** times you will have a high risk of collision (approximately 50% chance).

Let’s try.

import hashlib import os import base64 def get_uid(text): result = hashlib.md5(text).digest() result = base64.b64encode(result) return result[:4] lookup_table = {} for _ in range(4096): text = os.urandom(16) uid = get_uid(text) if uid in lookup_table: print("Collision detected") else: lookup_table[uid] = text

It does not give collision every time, but run it a few times and you will get.

Collision detected

Hence, it seems to be valid. The above code was run 1000 times and gave collision 497 times, which is close to 50% of the time.