64

I was messing around with making a command line parser and was wondering what kind of hash algorithm python dict's use?

The way I have it set up, I have a pattern match algorithm which matches tokenized input sequences with a dictionary key. Some of the keys are relatively long (length 5 or 6 tuples of 6-7 character strings). I was wondering if there was a point at which long dictionary keys significantly reduce the efficiency of key retrieval.

Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
  • 1
    Take a look at [Objects/dictnotes.txt](http://hg.python.org/cpython/file/2.7/Objects/dictnotes.txt) – jfs Jan 25 '12 at 05:13
  • 1
    Take a look at [this question](http://stackoverflow.com/questions/2070276/where-can-i-find-source-or-algorithm-of-pythons-hash-function). It has a link to [this page](http://effbot.org/zone/python-hash.htm) which describes how python hashes some different types and it might well be helpful to you. – srgerg Jan 25 '12 at 04:58
  • 2
    A lot of the modern hashing code was modified for PEP 456 which is documented here: https://www.python.org/dev/peps/pep-0456/. The answer is: multiple hash functions can be used depending on compilation arguments and string size. – Caleb Fenton Jan 23 '19 at 22:22

1 Answers1

70

The hash that it uses depends on the object being used as a key -- each class can define its own __hash__() method, and the value that it returns for a particular instance is what is used for the dictionary.

Python itself provides the hash implementation for str and tuple types. A quick look at the source should reveal the exact algorithm for those.

The hash of a tuple is based on the hashes of its contents. The algorithm is essentially this (simplified slightly):

def hash(tuple):
    mult = 1000003
    x = 0x345678
    for index, item in enumerate(tuple):
        x = ((x ^ hash(item)) * mult) & (1<<32)
        mult += (82520 + (len(tuple)-index)*2)
    return x + 97531

For strings, the interpreter also iterates over every character, combining them with this (again, slightly simplified) algorithm:

def hash(string):
    x = string[0] << 7
    for chr in string[1:]:
        x = ((1000003 * x) ^ chr) & (1<<32)
    return x

A bigger issue to worry about is avoiding hash collisions. Colliding hash keys will cause a linear search as the dictionary tries to find a place to store the new object (this is now being recognized as a security issue, and tha behavior may be changing in upcoming python versions)

Ian Clelland
  • 43,011
  • 8
  • 86
  • 87
  • 2
    Oh ok. For some reason I assumed python just used a generic bytecode hash algorithm for all data types. As far as colliding hash keys goes, I don't think that's going to be an issue, as the number of keys I will have is (relatively) small -- probably in the thousands. Pardon my ingnorance, but how do colliding hashes and linear searches become a security issue? – Joel Cornett Jan 25 '12 at 05:06
  • 7
    @Joel Cornett: This is a security issue because hash tables use buckets to store keys, and keys with the same hash code will be hashed to the same bucket, forcing the hash table to do a linear search each time it searches for a key, which can be very inefficient (and can even cause denial of service) if the number of keys is large. Denial-of-service attacks can result if a program encounters a hash table with distinct keys that hash to the same hash code. – Peter O. Jan 25 '12 at 05:17
  • 7
    If an attacker can control the keys that are used in your dictionary, then they might be able to insert hundreds or thousands of colliding keys, making insert operations very very slow. In some cases, this could cause a machine to become unresponsive, or a database to become unusable -- a Dos attack – Ian Clelland Jan 25 '12 at 05:24
  • 1
    Are there documented cases of such attacks being launched? It seems like it would be prohibitively difficult against all but the simplest system, and even harder without white-box knowledge. – Karl Knechtel Jan 25 '12 at 08:16
  • 2
    The first attack scenario I heard being discussed was against web services -- if the POST variables end up in a dictionary keyed on their names, then an attacker has complete control over the keys, and can slow down a server arbitrarily, with enough variables. I don't know if there are such attacks in the wild yet, though. – Ian Clelland Jan 25 '12 at 19:45
  • The `string_hash` function raises a TypeError. In Python, the left shift operator is looks for an integer, not a string. How do you refactor this to run in Python? – pylang Apr 20 '16 at 06:37
  • 1
    FYI the reported bug is https://bugs.python.org/issue13703 (Hash collision security issue), and the email thread pointing this issue is here https://mail.python.org/pipermail/python-dev/2011-December/115116.html, which cites a presentation at Chaos Communication Congress. – lseki Jul 16 '18 at 14:36
  • 1
    The described hash function doesn't actually make sense. Assuming you use ord to convert each character into an ascii version, it looks fine right up to the point that you & (1<<32). Since you are looking for all bits that are set in the result and are set in (0b100000000000000000000000000000000), you will always come out with 0. – Robert Rapplean Nov 27 '18 at 19:01
  • 1
    Do you know where the numbers come from? For example, why is `97531` added? Why not 97530 or 97529? Why not `13933`? – Martin Thoma Jan 16 '21 at 13:17