103

I've been playing with Python's hash function. For small integers, it appears hash(n) == n always. However this does not extend to large numbers:

>>> hash(2**100) == 2**100
False

I'm not surprised, I understand hash takes a finite range of values. What is that range?

I tried using binary search to find the smallest number hash(n) != n

>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:

binary_search(f, t)
    Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) \le t`. If :math:`f(n) > t` for all :math:`n \ge 0`, return None.

>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0

What's special about 2305843009213693951? I note it's less than sys.maxsize == 9223372036854775807

Edit: I'm using Python 3. I ran the same binary search on Python 2 and got a different result 2147483648, which I note is sys.maxint+1

I also played with [hash(random.random()) for i in range(10**6)] to estimate the range of hash function. The max is consistently below n above. Comparing the min, it seems Python 3's hash is always positively valued, whereas Python 2's hash can take negative values.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 9
    Have you checked the number's binary representation? – John Dvorak Jun 03 '16 at 10:57
  • 3
    '0b1111111111111111111111111111111111111111111111111111111111111' curious! So `n+1 == 2**61-1` – Colonel Panic Jun 03 '16 at 10:58
  • 2
    seems to be system dependent. With my python, the hash is `n` for the whole 64bit int range. – Daniel Jun 03 '16 at 11:00
  • 1
    Note the stated purpose of the hash value: *They are used to quickly compare dictionary keys during a dictionary lookup.* In other words, implementation-defined, and by virtue of being shorter than many values that can have hash values, may very well have collisions even in reasonable input spaces. – user Jun 03 '16 at 13:48
  • Colonel Panic, I modified the name of this question to include "Python." This seems to have caught enough interest to show up on "Hot Network Questions," where you don't immediately see the Python tag that you applied. Feel free to revert the change if it was undesired. – Cort Ammon Jun 03 '16 at 15:58
  • 2
    Um, isn't `2147483647` equal to `sys.maxint` (not `sys.maxint+1`), and if 'n = 0b1111111111111111111111111111111111111111111111111111111111111' then isn't `n+1 == 2**61` or `n == 2**61-1` (not `n+1 == 2**61-1`)? – phoog Jun 03 '16 at 19:35
  • `-1` is an interesting one as `hash(-1) == -2`. Hint: the return value of `-1` is used to signal errors in CPython and thus "reserved". – rszalski Jun 10 '16 at 14:34

4 Answers4

82

2305843009213693951 is 2^61 - 1. It's the largest Mersenne prime that fits into 64 bits.

If you have to make a hash just by taking the value mod some number, then a large Mersenne prime is a good choice -- it's easy to compute and ensures an even distribution of possibilities. (Although I personally would never make a hash this way)

It's especially convenient to compute the modulus for floating point numbers. They have an exponential component that multiplies the whole number by 2^x. Since 2^61 = 1 mod 2^61-1, you only need to consider the (exponent) mod 61.

See: https://en.wikipedia.org/wiki/Mersenne_prime

msanford
  • 11,803
  • 11
  • 66
  • 93
Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • 8
    You say you would never make a hash this way. Do you have alternative suggestions for how it could be done in a way that makes it reasonably efficient to compute for ints, floats, Decimals, Fractions _and_ ensures that `x == y` guarantees `hash(x) == hash(y)` across types? (Numbers like `Decimal('1e99999999')` are especially problematic, for example: you don't want to have to expand them out to the corresponding integer before hashing.) – Mark Dickinson Jun 03 '16 at 17:09
  • @MarkDickinson I suspect he's trying to draw a distinction between this simple lightening fast hash, and cryptographic hashes that also care about making the output look random. – Mike Ounsworth Jun 03 '16 at 21:43
  • 4
    @MarkDickinson The modulus is a good start, but I would then mix it up some more, especially mixing some of the high bits into the low. It's not uncommon to see sequences of integers divisible by powers of 2. It's also not uncommon to see hash tables with capacities that are powers of 2. In Java, for example, if you have a sequence of integers that are divisible by 16, and you use them as keys in a HashMap, you will only use 1/16th of the buckets (at least in the version of the source I'm looking at)! I think hashes ought to be at least a little bit random-looking to avoid these problerms – Matt Timmermans Jun 03 '16 at 22:01
  • Yes, bit-mixing style hashes are far superior to the math inspired ones. Bit-mixing instructions are so cheap that you can have many at the same cost. Also, real world data seems to not have patterns that do *not* work well with bit mixing. But there are patterns that are horrible for modulus. – usr Jun 04 '16 at 11:56
  • 10
    @usr: Sure, but a bit-mixing hash is infeasible here: the requirement that the hash work for `int`, `float`, `Decimal` and `Fraction` objects and that `x == y` implies `hash(x) == hash(y)` even when `x` and `y` have different types imposes some rather severe constraints. If it were just a matter of writing a hash function for integers, without worrying about the other types, it would be an entirely different matter. – Mark Dickinson Jun 04 '16 at 16:36
76

Based on python documentation in pyhash.c file:

For numeric types, the hash of a number x is based on the reduction of x modulo the prime P = 2**_PyHASH_BITS - 1. It's designed so that hash(x) == hash(y) whenever x and y are numerically equal, even if x and y have different types.

So for a 64/32 bit machine, the reduction would be 2 _PyHASH_BITS - 1, but what is _PyHASH_BITS?

You can find it in pyhash.h header file which for a 64 bit machine has been defined as 61 (you can read more explanation in pyconfig.h file).

#if SIZEOF_VOID_P >= 8
#  define _PyHASH_BITS 61
#else
#  define _PyHASH_BITS 31
#endif

So first off all it's based on your platform for example in my 64bit Linux platform the reduction is 261-1, which is 2305843009213693951:

>>> 2**61 - 1
2305843009213693951

Also You can use math.frexp in order to get the mantissa and exponent of sys.maxint which for a 64 bit machine shows that max int is 263:

>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)

And you can see the difference by a simple test:

>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False

Read the complete documentation about python hashing algorithm https://github.com/python/cpython/blob/master/Python/pyhash.c#L34

As mentioned in comment you can use sys.hash_info (in python 3.X) which will give you a struct sequence of parameters used for computing hashes.

>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>> 

Alongside the modulus that I've described in preceding lines, you can also get the inf value as following:

>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 3
    It would be nice to mention `sys.hash_info`, for completeness. – Mark Dickinson Jun 03 '16 at 17:00
  • Is this standard or specific to CPython? The official doc. says *"hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t."*, no mention of a prime, so I'm wondering... I know that `int` does not have custom `__hash__` but this behavior can also be seen when overriding `__hash__` and returning an integer larger than `sys.hash_info.modulus`. – Holt May 11 '22 at 15:09
9

Hash function returns plain int that means that returned value is greater than -sys.maxint and lower than sys.maxint, which means if you pass sys.maxint + x to it result would be -sys.maxint + (x - 2).

hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True

Meanwhile 2**200 is a n times greater than sys.maxint - my guess is that hash would go over range -sys.maxint..+sys.maxint n times until it stops on plain integer in that range, like in code snippets above..

So generally, for any n <= sys.maxint:

hash(sys.maxint*n) == -sys.maxint*(n%2) +  2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True

Note: this is true for python 2.

Andriy Ivaneyko
  • 20,639
  • 6
  • 60
  • 82
  • 8
    This may be true for Python 2, but definitely not for Python 3 (which doesn't have `sys.maxint`, and which uses a different hash function). – interjay Jun 03 '16 at 12:26
0

The implementation for the int type in cpython can be found here.

It just returns the value, except for -1, than it returns -2:

static long
int_hash(PyIntObject *v)
{
    /* XXX If this is changed, you also need to change the way
       Python's long, float and complex types are hashed. */
    long x = v -> ob_ival;
    if (x == -1)
        x = -2;
    return x;
}
Uyghur Lives Matter
  • 18,820
  • 42
  • 108
  • 144
Jieter
  • 4,101
  • 1
  • 19
  • 31