Birthday Paradox and Hash Function Collisions by Example

What will we cover in this tutorial?

We will look at how the Birthday Paradox is used when estimating how collision resistance a hash function is. This tutorial will show that a good estimate is that a n-bit hash function will have collision by chance with n/2-bit random hash values.

Step 1: Understand a hash function

A hash function is a one-way function with a fixed output size. That is, the output has the same size and it is difficult to find two distinct input chucks, which give the same output.

hash function is any function that can be used to map data of arbitrary size to fixed-size values.

Probably the best know example of a hash-function is the MD5. It was designed to be used as a cryptographic hash function, but has been found to have many vulnerabilities.

Does this mean you should not use the MD5 hash function?

That depends. If you use it in a cryptographic setup, the answer is Do not use.

On the other hand, hash function are often used to calculate identifiers. For that purpose, it also depends if you should use it or not.

This is where the Birthday Paradox comes in.

Step 2: How are hash functions and the Birthday Paradox related?

Good question. First recall what the Birthday Paradox states.

…in a random group of 23 people, there is about a 50 percent chance that two people have the same birthday

How can that be related to hash functions? There is something about collisions, right?

Given 23 people, we have 50% chance of collision (two people with the same birthday).

Hence, if we have that our hash functions maps data to a day in the calendar year. That is, it maps hash(data) -> [0, 364], then given 23 hash values, we have 50% chance for collision.

But you also know that our hash function maps to more than 365 distinct values. Actually, the MD5 maps to 2^128 distinct values.

An example would be appreciated now. Let us make a simplified hash function, call it MD5′ (md5-prime), which maps like the MD5, but only uses the first byte of the result.

That is, we have MD5′(data) -> [0, 255].

Surely, by the pigeonhole principle we would run out of possible values after 256 distinct data input to MD5′ and have a collision.

import hashlib
import os

lookup_table = {}
collision_count = 0
for _ in range(256):
    random_binary = os.urandom(16)
    result = hashlib.md5(random_binary).digest()
    result = result[:1]
    if result in lookup_table:
        print(random_binary, result)
        print(lookup_table[result], result)
        collision_count += 1
        lookup_table[result] = random_binary

print("Number of collisions:", collision_count)

The lookup_table is used to store the already seen hash values. We will iterate over the 256 (one less than possible values of our MD5′ hash function). Take some random data and hash it with md5 and only use first byte (8 bits). If result already exists in lookup_table we have a collision, otherwise add it to our lookup_table.

For a random run of this I got 87 collisions. Expected? I would say so.

Let us try to use the Birthday Paradox to estimate how many hash values we need to get a collision of our MD5′ hash function.

A rough estimate that is widely used, is that the square root of the number of possible outcomes will give a 50% chance of collision (see wikipedia for approximation).

That is, for MD5′(data) -> [0, 255] it is, sqrt(256) = 16. Let’s try that.

import hashlib
import os

collision = 0
for _ in range(1000):
    lookup_table = {}
    for _ in range(16):
        random_binary = os.urandom(16)
        result = hashlib.md5(random_binary).digest()
        result = result[:1]
        if result not in lookup_table:
            lookup_table[result] = random_binary
            collision += 1

print("Number of collisions:", collision, "out of", 1000)

Which gives some like this.

Number of collisions: 391 out of 1000

That is in the lower end, but still a reasonable approximation.

Step 3: Use a correct data structure to lookup in

Just to clarify. We will not find collisions on the full MD5 hash function, but we will try to see if the estimate of collision is reasonable.

This requires to do a lot of calculations and we want to ensure that we are not having a bottleneck with using a wrong data structure.

The Python dict should be a hash table with expected insert and lookup O(1). Still the worst case is O(n) for these operations, which would be a big overhead to cary along the way. Hence, we will first test, that the dictionary has O(1) insert and lookup time for the use cases we have of it here.

import time
import matplotlib.pyplot as plt

def dict_size(size):
    start = time.time()
    dict = {}
    for i in range(size):
        if i in dict:
            dict[i] = 0

    return time.time() - start

x = []
y = []
for i in range(0, 2**20, 2**12):
    performance = dict_size(i)

plt.scatter(x, y, alpha=0.1)
plt.ylabel("Time (sec)")

Resulting in something like this.

What does that tell us? That the dict in Python has a approximately linear insert and lookup time, that is O(1). But there some overhead at some sizes, e.g. a bit before 3,000,000. It is not exactly linear, but close enough not to expect a exponential run time.

This step is not necessary, but it is nice to know how the function grows in time, when we want to check for collisions. If the above time complexity grew exponentially (or not linearly), then it can suddenly become hard to estimate the runtime if we run for a bigger space.

Step 4: Validating if square root of the bit size is a good estimate for collision

We will continue our journey with our modified MD5′ hash function, where the output space will be reduced.

We will then for various output space sizes see if the estimate for 50% collision of the hash functions is decent. That is, if we need approximately sqrt(space_size) of hash values to have an approximately 50% chance of a collision.

This can be done by the following code.

import hashlib
import os
import time
import matplotlib.pyplot as plt

def main(bit_range):
    start = time.time()
    collision_count = 0
    # Each space_size counts for 4 bits, hence we have
    space_size = bit_range//4
    for _ in range(100):
        lookup_table = {}
        # Searching half the sqrt of the space for collision
        # sqrt(2**bit_range) = 2**(bit_range//2)
        for _ in range(2**(bit_range//2)):
            random_binary = os.urandom(16)
            result = hashlib.md5(random_binary).hexdigest()
            result = result[:space_size]
            if result in lookup_table:
                collision_count += 1
                lookup_table[result] = random_binary

    return time.time() - start, collision_count

x = []
y1 = []
y2 = []
for i in range(4, 44, 4):
    performance, count = main(i)

_, ax1 = plt.subplots()
plt.ylabel("Time (sec)")
ax1.scatter(x, y1)
ax2 = ax1.twinx(), y2, align='center', alpha=0.5, color='red')
ax2.set_ylabel("Collision rate (%)", color='red')
ax2.set_ylim([0, 100])

The estimated collision rate is very rough, as it only runs 100 trials for each space size.

The result are shown in the graph below.

Interestingly, it seems to be in the 30-50% range for most cases.

As a note, it might confuse that the run-time (the dots), does not seem to be linear. That is because for each bit-size we increase, we double the space. Hence, the x-axis is a logarithmic scale.

Step 5: What does that all mean?

This has high impact on using hash functions for creating unique identifiers. If you want a short identifier with the least number of bits, then you need to consider the Birthday Paradox.

Assume you created the following service.

import hashlib
import base64

def get_uid(text):
    result = hashlib.md5(text.encode()).digest()
    result = base64.b64encode(result)
    return result[:4]

uid = get_uid("my text")

If the input text can be considered random, how resistant is get_uid(…) function against collision.

Well, it returns 4 base64 characters. That is 6*4 = 24 bits of information (each base 64 character contains 6 bits of information). The rough estimate is that if you use it sprt(2^24) = 2^12 = 4,096 times you will have a high risk of collision (approximately 50% chance).

Let’s try.

import hashlib
import os
import base64

def get_uid(text):
    result = hashlib.md5(text).digest()
    result = base64.b64encode(result)
    return result[:4]

lookup_table = {}
for _ in range(4096):
    text = os.urandom(16)
    uid = get_uid(text)
    if uid in lookup_table:
        print("Collision detected")
        lookup_table[uid] = text

It does not give collision every time, but run it a few times and you will get.

Collision detected

Hence, it seems to be valid. The above code was run 1000 times and gave collision 497 times, which is close to 50% of the time.

Birthday Paradox by Example – it is not a Paradox

The Birthday Paradox is presented as follows.

…in a random group of 23 people, there is about a 50 percent chance that two people have the same birthday

Birthday Paradox

This is also referred to as the Birthday Problem in probability theory.

First question: What is a paradox?

…is a logically self-contradictory statement or a statement that runs contrary to one’s expectation


What does that mean? A logically self-contradictory statement‚ means that there should be a contradiction somewhere in the Birthday Paradox. This is not the case.

Then a statement that runs contrary to one’s expectations, could be open for discussion. As we will see, by example, in this post, it is not contrary to one’s expectation for an informed person.

Step 1: Run some examples

The assumption is that we have 23 random people. This assumes further, that the birthday of each one of these people is random.

To validate that this is true, let’s try to implement it in Python.

import random

stat = {'Collision': 0, 'No-collision': 0}

for _ in range(10000):
    days = []
    for _ in range(23):
        day = random.randint(0, 365)

    if len(days) == len(set(days)):
        stat['No-collision'] += 1
        stat['Collision'] += 1

print("Probability for at least 2 with same birthday in a group of 23")
print("P(A) =", stat['Collision']/(stat['Collision'] + stat['No-collision']))

This will output different results from run to run, but something around 0.507.

Probability for at least 2 with same birthday in a group of 23
P(A) = 0.5026

A few comments to the code. It keeps record of how many times of choosing 23 random birthdays, we will end with at least two of them being the same day. We run the experiment 10,000 times to have some idea if it is just pure luck.

The check if len(days) == len(set(days)) tests whether we did not have the same brirthday. If function set(…) takes all the unique days in the list. Hence, if we have two the same days days of the year, then the len (length) will be the same for the list and the set of days.

Step 2: The probability theory behind it

This is where it becomes a bit more technical. The above shows it behaves like it says. That if we take a group of 23 random people, with probability 50%, two of them will have the same birthday.

Is this contrary to one’s expectations? Hence, is it a paradox?

Before we answer that, let’s see if we can nail the probability theory behind this.

Do it step by step.

If we have 1 person, what is the probability that anyone in this group of 1 person has the same birthday? Yes, it sounds strange. The probability is obviously 0.

If we have 2 persons, what is the probability that any of the 2 people have the same birthday? Then they need to have the same birthday. Hence, the probability become 1/365.

How do you write that as an equation?

What we often do in probability theory, is, that we calculate the opposite probability.

Hence, we calculate the probability of now having two the same birthdays in a group. This is easier to calculate. In the first case, we have all possibilities open.

P(1) = 1

Where P(1) is the probability that given a group of one person, what is the probability of that person not having the same birthday as anyone in the group.

P(2) = 1 x (364 / 365)

Because, the first birthday is open for any birthday, then the second, only has 364 left of 365 possible birthdays.

This continues.

P(n) = 1 x (364 / 365) x (363 / 365) x … x ((365 – n + 1) / 365)

Which makes the probability of picking 23 random people without anyone with the same birthday to be.

P(23) = 1 x (364 / 365) x (363 / 365) x … x (343 / 365) = 0.493

Or calculated in Python.

def prop(n):
    if n == 1:
        return 1
        return (365 - n + 1) / 365 * prop(n - 1)

print("Probability for at no-one with same birthday in a group of 23")
print("P(A') =",  prop(23))

Which results in.

Probability for at no-one with same birthday in a group of 23
P(A') = 0.4927027656760144

This formula can be rewritten (see wikipedia), but for our purpose the above is fine for our purpose.

The probability we look for is given by.

P(A) = 1 – P(A’)

Step 3: Make a graph of how likely a collision is based on a group size

This is great news. We can now calculate the theoretical probability of two people having the same birthday in a group of n random people.

This can be achieved by the following code.

from matplotlib import pyplot as plt

def prop(n):
    if n == 1:
        return 1
        return (365 - n + 1) / 365 * prop(n - 1)

X = []
Y = []
for i in range(1, 90):
    Y.append(1 - prop(i))

plt.scatter(X, Y, color='blue')
plt.xlabel("Number of people")
plt.ylabel("Probability of collision")
plt.axis([0, 90, 0, 1])

Which results in the following plot.

Where you can see that about 23 people, we have 50% chance of having a pair with the same birthday (called collision).


Is it a paradox? Well, there is no magic in it. You can see the above are just simple calculations. But is the following contrary to one’s expectation?

6 weeks are 6*7*24*60*60 seconds = 3,628,800 seconds.

And 10! = 10*9*8*7*6*5*4*3*2*1 = 3,628,800.

Well, the first time you calculate it might be. But does that make it a paradox?

No, it is just a surprising fact the first time you see it. Does it mean that seconds are related to faculty? No, of course not. It is just one strange thing that connects in a random way.

The same with the Birthday Paradox, it is just surprising the first time you see it.

It seems surprising for people that you only need 23 people to have 50% chance of a pair with the same birthday, but it is not a paradox for people that work with numbers.