3

I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:

import hashlib
from bson.objectid import ObjectId

def gen_objectid(link):
    """Generates objectid from given link"""
    return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def

I have no idea how the shake_128 algorithm works. That is where my question comes in.

Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?

What is the better way to do this?

Dipu
  • 6,999
  • 4
  • 31
  • 48
  • 2
    All of the hashlib hash functions are cryptographic, so they are resistant to random collisions (and Shake should be more resistant to non-random collisions than MD5 or SHA-1). 12 bytes gives you `2**96` different hashes. According to [this Wikipedia probability table relating to "birthday attacks"](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table) the odds of a single collision on `8.9×10**48` records with a 96 bit hash are around `10**-18`. I think that should be adequate. ;) – PM 2Ring Oct 29 '17 at 17:12
  • The number of my crawled items should not exceed `10^8`. Your comment gives me a peace of mind :) – Dipu Oct 29 '17 at 17:31

1 Answers1

0

shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.

HackerBoss
  • 829
  • 7
  • 16