I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:
import hashlib
from bson.objectid import ObjectId
def gen_objectid(link):
"""Generates objectid from given link"""
return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def
I have no idea how the shake_128
algorithm works. That is where my question comes in.
Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?
What is the better way to do this?