0

Is there any approach to implement hashing without any collisions in python 3?

I am using mmh3 provided by mmh3

import mmh3
string = "/hjhfkhdf/jefhfueiow-/eflkjhfeiero-kk&/kerdfujelifjr(0kjlegjfejf/?/jdfkhe"
mmh3.hash128(string)

To avoid collisions and I am implementing Salt (or seed). is that good enough for uniqueness?

import mmh3
string = "/hjhfkhdf/jefhfueiow-/eflkjhfeiero-kk&/kerdfujelifjr(0kjlegjfejf/?/jdfkhe"
mmh3.hash128(string, 12, signed=True)

Purpose:

my string contains special characters and long string, I need to update in DB and index it. I am assuming (May be wrong), query with special characters lead issues. this is the reason I am generating hashing and store in DB. If my assumption is wrong then I can store original value

Vidya
  • 547
  • 1
  • 10
  • 26
  • 3
    It will never be completely unique, that is inherent to hashing. The chance of a collision will go down with larger hash sizes but will never reach zero. You have to define a reasonable cutoff value based on the number of hashes you will create. – Jan Wilamowski Jan 26 '22 at 02:46
  • @JanWilamowski, could you please explain more on defining cut off value? in my case supplied string itself unique – Vidya Jan 26 '22 at 02:49
  • 2
    If your data is unique, why use hashing in the first place? Also, in the question you wrote that your data is "almost unique". If you have duplicates in your inputs then the hash function will generate the same values for them. You could simply append a random value to each entry. – Jan Wilamowski Jan 26 '22 at 02:53
  • @Vidya. How many unique strings do you have? Let's say that number is N. Clearly you will need log2(N) bits to make a lookup table for those strings. Hashes are just LUT keys most of the time. – Mad Physicist Jan 26 '22 at 02:53
  • @JanWilamowski sorry confusion and It is always unique. I am updating db and indexing it. The values contains special characters and strings are big. to save storage and better query experience, I am converting to hasing and updating. but expecting not to create collisions hash for db updates – Vidya Jan 26 '22 at 02:59
  • @MadPhysicist, all string inputs are unique – Vidya Jan 26 '22 at 02:59
  • 1
    You have chosen MurmurHash. It is a non-cryptographic hash algorithm and know for a weakness: collisions can be constructed. To prevent collisions to a degree where they are next to impossible, use a cryptographic hash with many bits e.g. SHA512. – Klaus D. Jan 26 '22 at 03:00
  • @Vidya. How many are there? – Mad Physicist Jan 26 '22 at 03:50
  • @MadPhysicist, I did not get you all processed input strings are unique – Vidya Jan 26 '22 at 03:55
  • @Vidya. How many unique strings will you have? That's important because, say you allow 100-character strings with 26 letters. That's 3x10^141 possibilities, and that's without going into into the possibility of shorter strings. That many strings will take just over 470 bits to encode uniquely in a hash, or you will have collisions. That's all there is to it: if you have N items, you need log2(N) bits to encode them uniquely. – Mad Physicist Jan 26 '22 at 04:02
  • @MadPhysicist upto 100 millions strings I need to create hashing and length of each string size vary from 5 characters to 512 – Vidya Jan 26 '22 at 04:23
  • @Vidya. log2(100,000,000) is only ~26.5, so you need "only" 27 (or so, depending on how many hundreds) bits. So at least in theory, a perfect collision-free hash is possible for your case. Making one that actually works is a bit more challenging than positing its existence though. – Mad Physicist Jan 26 '22 at 04:45
  • @MadPhysicist, my solution can also be extended 1000 millions? – Vidya Jan 26 '22 at 04:47
  • @Vidya. I'm sure you don't need me to run log2(N) for you. – Mad Physicist Jan 26 '22 at 06:19
  • @MadPhysicist, it comes 29.x so it is 30 bits, max we can extend to 470 bits? as per theory until 470 bits wont generate any collisions. my understanding is correct? – Vidya Jan 26 '22 at 06:25
  • @Vidya. Sure, the number of possible strings gives you the number of bits theoretically necessary to hold that many distinct hashes. But good luck trying to make a hash that can uniquely map that many arbitrary input strings. – Mad Physicist Jan 26 '22 at 06:27

0 Answers0