Learn how you can become a Python programmer in just 12 weeks.

    We respect your privacy. Unsubscribe at anytime.

    Birthday Paradox and Hash Function Collisions by Example

    What will we cover in this tutorial?

    We will look at how the Birthday Paradox is used when estimating how collision resistance a hash function is. This tutorial will show that a good estimate is that a n-bit hash function will have collision by chance with n/2-bit random hash values.

    Step 1: Understand a hash function

    A hash function is a one-way function with a fixed output size. That is, the output has the same size and it is difficult to find two distinct input chucks, which give the same output.

    hash function is any function that can be used to map data of arbitrary size to fixed-size values. 

    https://en.wikipedia.org/wiki/Hash_function

    Probably the best know example of a hash-function is the MD5. It was designed to be used as a cryptographic hash function, but has been found to have many vulnerabilities.

    Does this mean you should not use the MD5 hash function?

    That depends. If you use it in a cryptographic setup, the answer is Do not use.

    On the other hand, hash function are often used to calculate identifiers. For that purpose, it also depends if you should use it or not.

    This is where the Birthday Paradox comes in.

    Step 2: How are hash functions and the Birthday Paradox related?

    Good question. First recall what the Birthday Paradox states.

    …in a random group of 23 people, there is about a 50 percent chance that two people have the same birthday

    http://www.learnpythonwithrune.org/birthday-paradox-by-example-it-is-not-a-paradox/

    How can that be related to hash functions? There is something about collisions, right?

    Given 23 people, we have 50% chance of collision (two people with the same birthday).

    Hence, if we have that our hash functions maps data to a day in the calendar year. That is, it maps hash(data) -> [0, 364], then given 23 hash values, we have 50% chance for collision.

    But you also know that our hash function maps to more than 365 distinct values. Actually, the MD5 maps to 2^128 distinct values.

    An example would be appreciated now. Let us make a simplified hash function, call it MD5′ (md5-prime), which maps like the MD5, but only uses the first byte of the result.

    That is, we have MD5′(data) -> [0, 255].

    Surely, by the pigeonhole principle we would run out of possible values after 256 distinct data input to MD5′ and have a collision.

    import hashlib
    import os
    
    lookup_table = {}
    collision_count = 0
    for _ in range(256):
        random_binary = os.urandom(16)
        result = hashlib.md5(random_binary).digest()
        result = result[:1]
        if result in lookup_table:
            print("Collision")
            print(random_binary, result)
            print(lookup_table[result], result)
            collision_count += 1
        else:
            lookup_table[result] = random_binary
    print("Number of collisions:", collision_count)
    

    The lookup_table is used to store the already seen hash values. We will iterate over the 256 (one less than possible values of our MD5′ hash function). Take some random data and hash it with md5 and only use first byte (8 bits). If result already exists in lookup_table we have a collision, otherwise add it to our lookup_table.

    For a random run of this I got 87 collisions. Expected? I would say so.

    Let us try to use the Birthday Paradox to estimate how many hash values we need to get a collision of our MD5′ hash function.

    A rough estimate that is widely used, is that the square root of the number of possible outcomes will give a 50% chance of collision (see wikipedia for approximation).

    That is, for MD5′(data) -> [0, 255] it is, sqrt(256) = 16. Let’s try that.

    import hashlib
    import os
    
    collision = 0
    for _ in range(1000):
        lookup_table = {}
        for _ in range(16):
            random_binary = os.urandom(16)
            result = hashlib.md5(random_binary).digest()
            result = result[:1]
            if result not in lookup_table:
                lookup_table[result] = random_binary
            else:
                collision += 1
                break
    print("Number of collisions:", collision, "out of", 1000)
    

    Which gives some like this.

    Number of collisions: 391 out of 1000
    

    That is in the lower end, but still a reasonable approximation.

    Step 3: Use a correct data structure to lookup in

    Just to clarify. We will not find collisions on the full MD5 hash function, but we will try to see if the estimate of collision is reasonable.

    This requires to do a lot of calculations and we want to ensure that we are not having a bottleneck with using a wrong data structure.

    The Python dict should be a hash table with expected insert and lookup O(1). Still the worst case is O(n) for these operations, which would be a big overhead to cary along the way. Hence, we will first test, that the dictionary has O(1) insert and lookup time for the use cases we have of it here.

    import time
    import matplotlib.pyplot as plt
    
    def dict_size(size):
        start = time.time()
        dict = {}
        for i in range(size):
            if i in dict:
                print("HIT")
            else:
                dict[i] = 0
        return time.time() - start
    
    x = []
    y = []
    for i in range(0, 2**20, 2**12):
        performance = dict_size(i)
        x.append(i)
        y.append(performance)
    plt.scatter(x, y, alpha=0.1)
    plt.xlabel("Size")
    plt.ylabel("Time (sec)")
    plt.show()
    

    Resulting in something like this.

    What does that tell us? That the dict in Python has a approximately linear insert and lookup time, that is O(1). But there some overhead at some sizes, e.g. a bit before 3,000,000. It is not exactly linear, but close enough not to expect a exponential run time.

    This step is not necessary, but it is nice to know how the function grows in time, when we want to check for collisions. If the above time complexity grew exponentially (or not linearly), then it can suddenly become hard to estimate the runtime if we run for a bigger space.

    Step 4: Validating if square root of the bit size is a good estimate for collision

    We will continue our journey with our modified MD5′ hash function, where the output space will be reduced.

    We will then for various output space sizes see if the estimate for 50% collision of the hash functions is decent. That is, if we need approximately sqrt(space_size) of hash values to have an approximately 50% chance of a collision.

    This can be done by the following code.

    import hashlib
    import os
    import time
    import matplotlib.pyplot as plt
    
    def main(bit_range):
        start = time.time()
        collision_count = 0
        # Each space_size counts for 4 bits, hence we have
        space_size = bit_range//4
        for _ in range(100):
            lookup_table = {}
            # Searching half the sqrt of the space for collision
            # sqrt(2**bit_range) = 2**(bit_range//2)
            for _ in range(2**(bit_range//2)):
                random_binary = os.urandom(16)
                result = hashlib.md5(random_binary).hexdigest()
                result = result[:space_size]
                if result in lookup_table:
                    collision_count += 1
                    break
                else:
                    lookup_table[result] = random_binary
        return time.time() - start, collision_count
    
    x = []
    y1 = []
    y2 = []
    for i in range(4, 44, 4):
        performance, count = main(i)
        x.append(i)
        y1.append(performance)
        y2.append(count)
    _, ax1 = plt.subplots()
    plt.xlabel("Size")
    plt.ylabel("Time (sec)")
    ax1.scatter(x, y1)
    ax2 = ax1.twinx()
    ax2.bar(x, y2, align='center', alpha=0.5, color='red')
    ax2.set_ylabel("Collision rate (%)", color='red')
    ax2.set_ylim([0, 100])
    plt.show()
    

    The estimated collision rate is very rough, as it only runs 100 trials for each space size.

    The result are shown in the graph below.

    Interestingly, it seems to be in the 30-50% range for most cases.

    As a note, it might confuse that the run-time (the dots), does not seem to be linear. That is because for each bit-size we increase, we double the space. Hence, the x-axis is a logarithmic scale.

    Step 5: What does that all mean?

    This has high impact on using hash functions for creating unique identifiers. If you want a short identifier with the least number of bits, then you need to consider the Birthday Paradox.

    Assume you created the following service.

    import hashlib
    import base64
    
    def get_uid(text):
        result = hashlib.md5(text.encode()).digest()
        result = base64.b64encode(result)
        return result[:4]
    
    uid = get_uid("my text")
    print(uid)
    

    If the input text can be considered random, how resistant is get_uid(…) function against collision.

    Well, it returns 4 base64 characters. That is 6*4 = 24 bits of information (each base 64 character contains 6 bits of information). The rough estimate is that if you use it sprt(2^24) = 2^12 = 4,096 times you will have a high risk of collision (approximately 50% chance).

    Let’s try.

    import hashlib
    import os
    import base64
    
    def get_uid(text):
        result = hashlib.md5(text).digest()
        result = base64.b64encode(result)
        return result[:4]
    
    lookup_table = {}
    for _ in range(4096):
        text = os.urandom(16)
        uid = get_uid(text)
        if uid in lookup_table:
            print("Collision detected")
        else:
            lookup_table[uid] = text
    

    It does not give collision every time, but run it a few times and you will get.

    Collision detected
    

    Hence, it seems to be valid. The above code was run 1000 times and gave collision 497 times, which is close to 50% of the time.

    Python Circle

    Do you know what the 5 key success factors every programmer must have?

    How is it possible that some people become programmer so fast?

    While others struggle for years and still fail.

    Not only do they learn python 10 times faster they solve complex problems with ease.

    What separates them from the rest?

    I identified these 5 success factors that every programmer must have to succeed:

    1. Collaboration: sharing your work with others and receiving help with any questions or challenges you may have.
    2. Networking: the ability to connect with the right people and leverage their knowledge, experience, and resources.
    3. Support: receive feedback on your work and ask questions without feeling intimidated or judged.
    4. Accountability: stay motivated and accountable to your learning goals by surrounding yourself with others who are also committed to learning Python.
    5. Feedback from the instructor: receiving feedback and support from an instructor with years of experience in the field.

    I know how important these success factors are for growth and progress in mastering Python.

    That is why I want to make them available to anyone struggling to learn or who just wants to improve faster.

    With the Python Circle community, you can take advantage of 5 key success factors every programmer must have.

    Python Circle
    Python Circle

    Be part of something bigger and join the Python Circle community.

    Leave a Comment