Birthday Paradox and Hash Function Collisions by Example

What will we cover in this tutorial?

We will look at how the Birthday Paradox is used when estimating how collision resistance a hash function is. This tutorial will show that a good estimate is that a n-bit hash function will have collision by chance with n/2-bit random hash values.

Step 1: Understand a hash function

A hash function is a one-way function with a fixed output size. That is, the output has the same size and it is difficult to find two distinct input chucks, which give the same output.

hash function is any function that can be used to map data of arbitrary size to fixed-size values. 

https://en.wikipedia.org/wiki/Hash_function

Probably the best know example of a hash-function is the MD5. It was designed to be used as a cryptographic hash function, but has been found to have many vulnerabilities.

Does this mean you should not use the MD5 hash function?

That depends. If you use it in a cryptographic setup, the answer is Do not use.

On the other hand, hash function are often used to calculate identifiers. For that purpose, it also depends if you should use it or not.

This is where the Birthday Paradox comes in.

Step 2: How are hash functions and the Birthday Paradox related?

Good question. First recall what the Birthday Paradox states.

…in a random group of 23 people, there is about a 50 percent chance that two people have the same birthday

https://www.learnpythonwithrune.org/birthday-paradox-by-example-it-is-not-a-paradox/

How can that be related to hash functions? There is something about collisions, right?

Given 23 people, we have 50% chance of collision (two people with the same birthday).

Hence, if we have that our hash functions maps data to a day in the calendar year. That is, it maps hash(data) -> [0, 364], then given 23 hash values, we have 50% chance for collision.

But you also know that our hash function maps to more than 365 distinct values. Actually, the MD5 maps to 2^128 distinct values.

An example would be appreciated now. Let us make a simplified hash function, call it MD5′ (md5-prime), which maps like the MD5, but only uses the first byte of the result.

That is, we have MD5′(data) -> [0, 255].

Surely, by the pigeonhole principle we would run out of possible values after 256 distinct data input to MD5′ and have a collision.

import hashlib
import os


lookup_table = {}
collision_count = 0
for _ in range(256):
    random_binary = os.urandom(16)
    result = hashlib.md5(random_binary).digest()
    result = result[:1]
    if result in lookup_table:
        print("Collision")
        print(random_binary, result)
        print(lookup_table[result], result)
        collision_count += 1
    else:
        lookup_table[result] = random_binary

print("Number of collisions:", collision_count)

The lookup_table is used to store the already seen hash values. We will iterate over the 256 (one less than possible values of our MD5′ hash function). Take some random data and hash it with md5 and only use first byte (8 bits). If result already exists in lookup_table we have a collision, otherwise add it to our lookup_table.

For a random run of this I got 87 collisions. Expected? I would say so.

Let us try to use the Birthday Paradox to estimate how many hash values we need to get a collision of our MD5′ hash function.

A rough estimate that is widely used, is that the square root of the number of possible outcomes will give a 50% chance of collision (see wikipedia for approximation).

That is, for MD5′(data) -> [0, 255] it is, sqrt(256) = 16. Let’s try that.

import hashlib
import os


collision = 0
for _ in range(1000):
    lookup_table = {}
    for _ in range(16):
        random_binary = os.urandom(16)
        result = hashlib.md5(random_binary).digest()
        result = result[:1]
        if result not in lookup_table:
            lookup_table[result] = random_binary
        else:
            collision += 1
            break

print("Number of collisions:", collision, "out of", 1000)

Which gives some like this.

Number of collisions: 391 out of 1000

That is in the lower end, but still a reasonable approximation.

Step 3: Use a correct data structure to lookup in

Just to clarify. We will not find collisions on the full MD5 hash function, but we will try to see if the estimate of collision is reasonable.

This requires to do a lot of calculations and we want to ensure that we are not having a bottleneck with using a wrong data structure.

The Python dict should be a hash table with expected insert and lookup O(1). Still the worst case is O(n) for these operations, which would be a big overhead to cary along the way. Hence, we will first test, that the dictionary has O(1) insert and lookup time for the use cases we have of it here.

import time
import matplotlib.pyplot as plt



def dict_size(size):
    start = time.time()
    dict = {}
    for i in range(size):
        if i in dict:
            print("HIT")
        else:
            dict[i] = 0

    return time.time() - start


x = []
y = []
for i in range(0, 2**20, 2**12):
    performance = dict_size(i)
    x.append(i)
    y.append(performance)

plt.scatter(x, y, alpha=0.1)
plt.xlabel("Size")
plt.ylabel("Time (sec)")
plt.show()

Resulting in something like this.

What does that tell us? That the dict in Python has a approximately linear insert and lookup time, that is O(1). But there some overhead at some sizes, e.g. a bit before 3,000,000. It is not exactly linear, but close enough not to expect a exponential run time.

This step is not necessary, but it is nice to know how the function grows in time, when we want to check for collisions. If the above time complexity grew exponentially (or not linearly), then it can suddenly become hard to estimate the runtime if we run for a bigger space.

Step 4: Validating if square root of the bit size is a good estimate for collision

We will continue our journey with our modified MD5′ hash function, where the output space will be reduced.

We will then for various output space sizes see if the estimate for 50% collision of the hash functions is decent. That is, if we need approximately sqrt(space_size) of hash values to have an approximately 50% chance of a collision.

This can be done by the following code.

import hashlib
import os
import time
import matplotlib.pyplot as plt


def main(bit_range):
    start = time.time()
    collision_count = 0
    # Each space_size counts for 4 bits, hence we have
    space_size = bit_range//4
    for _ in range(100):
        lookup_table = {}
        # Searching half the sqrt of the space for collision
        # sqrt(2**bit_range) = 2**(bit_range//2)
        for _ in range(2**(bit_range//2)):
            random_binary = os.urandom(16)
            result = hashlib.md5(random_binary).hexdigest()
            result = result[:space_size]
            if result in lookup_table:
                collision_count += 1
                break
            else:
                lookup_table[result] = random_binary

    return time.time() - start, collision_count


x = []
y1 = []
y2 = []
for i in range(4, 44, 4):
    performance, count = main(i)
    x.append(i)
    y1.append(performance)
    y2.append(count)

_, ax1 = plt.subplots()
plt.xlabel("Size")
plt.ylabel("Time (sec)")
ax1.scatter(x, y1)
ax2 = ax1.twinx()
ax2.bar(x, y2, align='center', alpha=0.5, color='red')
ax2.set_ylabel("Collision rate (%)", color='red')
ax2.set_ylim([0, 100])

plt.show()

The estimated collision rate is very rough, as it only runs 100 trials for each space size.

The result are shown in the graph below.

Interestingly, it seems to be in the 30-50% range for most cases.

As a note, it might confuse that the run-time (the dots), does not seem to be linear. That is because for each bit-size we increase, we double the space. Hence, the x-axis is a logarithmic scale.

Step 5: What does that all mean?

This has high impact on using hash functions for creating unique identifiers. If you want a short identifier with the least number of bits, then you need to consider the Birthday Paradox.

Assume you created the following service.

import hashlib
import base64


def get_uid(text):
    result = hashlib.md5(text.encode()).digest()
    result = base64.b64encode(result)
    return result[:4]


uid = get_uid("my text")
print(uid)

If the input text can be considered random, how resistant is get_uid(…) function against collision.

Well, it returns 4 base64 characters. That is 6*4 = 24 bits of information (each base 64 character contains 6 bits of information). The rough estimate is that if you use it sprt(2^24) = 2^12 = 4,096 times you will have a high risk of collision (approximately 50% chance).

Let’s try.

import hashlib
import os
import base64


def get_uid(text):
    result = hashlib.md5(text).digest()
    result = base64.b64encode(result)
    return result[:4]


lookup_table = {}
for _ in range(4096):
    text = os.urandom(16)
    uid = get_uid(text)
    if uid in lookup_table:
        print("Collision detected")
    else:
        lookup_table[uid] = text

It does not give collision every time, but run it a few times and you will get.

Collision detected

Hence, it seems to be valid. The above code was run 1000 times and gave collision 497 times, which is close to 50% of the time.

Leave a Reply