1

The following Python snippet behaves differently depending on whether the code is run on 32bit or 64bit architectures:

PYTHONHASHSEED=0 python3 -c 'print(hash("a"))'

On 32bit architectures it prints -845962679 while on 64bit architectures it prints -7583489610679606711.

This in turn means, that even when setting PYTHONHASHSEED=0, the order of dictionary keys for example depends on the architecture and is only deterministic within 32bit or 64bit architectures, respectively.

This problem can be worked around by either making sure the output is always sorted or by monkeypatching the hash function like this:

oldhash = __builtins__.hash
__builtins__.hash = lambda x: oldhash(x) & 0xFFFFFFFF

Unfortunately either of these workarounds do not work when one uses an external library. In my case I want to use the networkx module but while the networkx maintainers might get to solving this issue at some point (either by providing sorted output or by patching their hash function) they might as well decide not to do so or only fix this problem in the far future. Plus, networkx is not the only module producing output that depends on the output of the hash function and I'd like to find a solution to fix them all right now without having to wait for external projects or having to carry local patches.

So my question boils down to: is it possible to modify the Python hash function as it is used by all modules that I import?

Or is there another solution that lets me use modules that produce output which depends on the hash function in a way such that it is deterministic between 32bit and 64bit architectures?

EDIT: as suggested I'm making this answer more specific. So I'm choosing networkx as an example to demonstrate the problem. But I'd really want a general solution I can apply to other modules I import as well.

Consider the following graph test.xml in graphml:

<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <graph edgedefault="directed">
    <node id="n1616"/> <node id="n48"/> <node id="n3637"/> <node id="n2842"/>
    <node id="n2530"/> <node id="n2396"/> <node id="n6453"/> <node id="n278"/>
    <node id="n1209"/> <node id="n92"/> <node id="n793"/> <node id="n3631"/>
    <node id="n341"/> <node id="n3151"/> <node id="n1717"/> <node id="n890"/>
    <node id="n11399"/> <node id="n203"/> <node id="n1928"/> <node id="n555"/>
    <node id="n156"/> <node id="n553"/> <node id="n2524"/> <node id="n3396"/>
    <node id="n1741"/> <node id="n4117"/> <node id="n959"/> <node id="n1667"/>
    <node id="n6489"/> <node id="n4973"/> <node id="n2247"/> <node id="n927"/>
    <node id="n1211"/> <node id="n5467"/> <node id="n450"/> <node id="n1727"/>
    <node id="n3531"/> <node id="n6357"/> <node id="n317"/> <node id="n37"/>
    <node id="n14349"/> <node id="n1530"/> <node id="n12429"/>
    <node id="n249"/> <node id="n348"/> <node id="n3285"/> <node id="n2518"/>
    <node id="n406"/> <node id="n2034"/> <node id="n2855"/> <node id="n6"/>
    <node id="n4742"/> <node id="n125"/> <node id="n281"/> <node id="n44"/>
    <node id="n924"/> <node id="n926"/> <node id="n251"/> <node id="n5455"/>
    <node id="n666"/> <node id="n3112"/> <node id="n2870"/> <node id="n6452"/>
    <node id="n3156"/> <node id="n2299"/> <node id="n416"/> <node id="n4556"/>
    <node id="n1832"/> <node id="n89"/> <node id="n2342"/> <node id="n1327"/>
    <node id="n1333"/> <node id="n542"/> <node id="n674"/> <node id="n47"/>
    <node id="n1174"/> <node id="n102"/> <node id="n1570"/> <node id="n1362"/>
    <node id="n9721"/> <node id="n789"/> <node id="n270"/> <node id="n1524"/>
    <node id="n4616"/> <node id="n6093"/> <node id="n2386"/>
  </graph>
</graphml>

Then the following produces different output depending on the architecture:

PYTHONHASHSEED=0 python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); networkx.write_graphml(g, sys.stdout.buffer)" | md5sum

Though instead it should be the same. I'm not able to come up with a more minimal example because it seems that if I remove a node from the XML input above, the output becomes the same. I don't know why this happens.

It would help if networkx did not rely on the dictionary order for output but instead allowed to sort the output but when I brought this up and filed it as https://github.com/networkx/networkx/issues/1181 the bug was closed after some discussion without accepting my patch. In the meanwhile I fixed part of this problem with PYTHONHASHSEED=0 but that does not fix the problem of different output between 32bit and 64bit architectures.

josch
  • 6,716
  • 3
  • 41
  • 49
  • The problem this is causing is, that it makes my testsuite produce different output on different architectures and thus I'm not able to test whether the generated output is the expected one or not by just comparing a sha or md5sum. – josch Oct 12 '14 at 13:36
  • 2
    I suspect the output you are referring to is the output of the `__repr__` function for an object, which should be treated as a debugging aid, not a productive part of the API. Test the state of the object, not some arbitrary string representation of it. – chepner Oct 12 '14 at 13:39
  • @chepner sorry I do not understand - what exactly do you think has to be improved about my question? – josch Oct 12 '14 at 13:41
  • I can't currently find the respective parts in the documentation, but I'm pretty sure Python's builtin `__hash__` is meant for internal use, so it's not designed to be platform-independent. Use [`hashlib`](http://docs.python.org/library/hashlib.html?highlight=hash#module-hashlib) instead if you need a stable hash of fixed length. – Lukas Graf Oct 12 '14 at 13:42
  • Please provide an example of the code you feel should provide a fixed, predicable output. – chepner Oct 12 '14 at 13:42
  • @josch yep, I just realized that, those comments trickled in while I was Googling, nevermind my comment then. – Lukas Graf Oct 12 '14 at 13:44
  • 3
    Wait - dictionary key order? You shouldn't make any assertions on the order of a dictionary anyway, because its order is arbitrary. If order is relevant then a dictionary is the wrong data structure. – Lukas Graf Oct 12 '14 at 13:48
  • @LukasGraf I know and I'm not making any assumptions about the order - the modules I use do. I want to find a way to force the modules to use a specific hash such that it doesn't matter that they make the output order depending on dictionary order. – josch Oct 12 '14 at 13:52
  • I see - but in my opinion that's still the wrong approach. I second @chepner's advice: *"Test the state of the object, not some arbitrary string representation of it."*. Can't you for example make assertions on the actual graph (not its string representation) using [`assertDictContainsSubset`](https://docs.python.org/2/library/unittest.html#unittest.TestCase.assertDictContainsSubset) (or write your own assertion helper method)? – Lukas Graf Oct 12 '14 at 13:55
  • Yes, I see how that would result in additional work on your end, but given that graphs don't have an inherent order, I can also understand that the `networkx` guys are somewhat reluctant to a change that enforces deterministic output. – Lukas Graf Oct 12 '14 at 14:10
  • @LukasGraf which gets me back to my original question: is there a way to patch the modules I use (like networkx) to use a known-good hash function instead of the built-in one? – josch Oct 12 '14 at 14:12
  • 2
    Also: Just monkey-patching the `hash()` function is unlikely to completely solve your problem. If the order of some operations changes in your code (but functionally it stays the same), the order in which dictionary insert collisions happen might change, and so will the key order. Even though the code produces an equivalent dictionary. (See [The Mighty Dictionary](https://www.youtube.com/watch?v=C4Kc8xzcA68) if you haven't yet). – Lukas Graf Oct 12 '14 at 14:15
  • Example that intentionally forces hash-collisions to produce different key orders for two equivalent dictionaries: http://ideone.com/ahbd9a – Lukas Graf Oct 12 '14 at 14:23
  • @LukasGraf I know that the insertion order matters but given the same input files, the insertion order should be the same – josch Oct 12 '14 at 14:32

0 Answers0