Algorithm to generate 12 byte hash from Web URLs

Question

I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:

import hashlib
from bson.objectid import ObjectId

def gen_objectid(link):
    """Generates objectid from given link"""
    return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def

I have no idea how the shake_128 algorithm works. That is where my question comes in.

Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?

What is the better way to do this?

All of the hashlib hash functions are cryptographic, so they are resistant to random collisions (and Shake should be more resistant to non-random collisions than MD5 or SHA-1). 12 bytes gives you `2**96` different hashes. According to [this Wikipedia probability table relating to "birthday attacks"](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table) the odds of a single collision on `8.9×10**48` records with a 96 bit hash are around `10**-18`. I think that should be adequate. ;) — PM 2Ring, Oct 29 '17 at 17:12
The number of my crawled items should not exceed `10^8`. Your comment gives me a peace of mind :) — Dipu, Oct 29 '17 at 17:31

score 0 · Answer 1 · answered Oct 11 '18 at 15:01

shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.

Algorithm to generate 12 byte hash from Web URLs

1 Answers1