I am experiencing an odd behavior with python's built-in hash()
function.
I am writing hashes of strings (which come from various language/text datasets) into 900 different files, according to the first three digits of their hash values. Now I noticed that all files up to 921 tend to have very similar sizes, i.e. a similar number of hashes written to them (as is to be expected); but the size of file 922 drops to around a third of that; and all files after 922 tend to have a size of only 1/10 of the other files. This happens for inherently different datasets (e.g. https://huggingface.co/datasets/embedding-data/Amazon-QA or the captions from https://cocodataset.org/#home), so it doesn't seem to be caused by the data. Shouldn't (random) strings be cast to uniformly distributed hashes? Could this be caused by the hash()
function itself, i.e. is there something odd about the built-in hash function?
Asked
Active
Viewed 42 times
0

joinijo
- 353
- 2
- 9
-
Um, if you are only using the first three digits of the hash values you are dropping bytes that could disambiguate the uniqueness of the hash - also the input data could be biased and that you haven't outlined your full algorithm, nor have you discussed whether you [disabled has randomization](https://stackoverflow.com/questions/30585108/disable-hash-randomization-from-within-python-program). For studies like this you need to be extremely robust because the claim you are making is rather outlandish, so you need to be rigorous in support of your claim. – metatoaster May 19 '23 at 07:47
-
Moreover, the algorithm used is SipHash as per [PEP 456](https://peps.python.org/pep-0456/#siphash), and its designed to be used with a randomly set key, to counter the weakness where specific sets of inputs will hash the same. The point is that the full hash is different, and if you found a set of inputs that hash the same it generally should only affect usage with the chosen key value anyway. – metatoaster May 19 '23 at 07:53
-
2The problem is that hashes aren't all the same length. Notice you have the same first 3 digits in 999, 9990 through 9999, 99900 through 99999, and so on. Are you adding leading zeroes to get them all to the same length first? – Barmar May 19 '23 at 08:04
-
2A better way to distribute would probably be a modulus rather than the first N digits. – Barmar May 19 '23 at 08:05