The following Python snippet behaves differently depending on whether the code is run on 32bit or 64bit architectures:
PYTHONHASHSEED=0 python3 -c 'print(hash("a"))'
On 32bit architectures it prints -845962679
while on 64bit architectures it prints -7583489610679606711
.
This in turn means, that even when setting PYTHONHASHSEED=0
, the order of dictionary keys for example depends on the architecture and is only deterministic within 32bit or 64bit architectures, respectively.
This problem can be worked around by either making sure the output is always sorted or by monkeypatching the hash function like this:
oldhash = __builtins__.hash
__builtins__.hash = lambda x: oldhash(x) & 0xFFFFFFFF
Unfortunately either of these workarounds do not work when one uses an external library. In my case I want to use the networkx
module but while the networkx
maintainers might get to solving this issue at some point (either by providing sorted output or by patching their hash function) they might as well decide not to do so or only fix this problem in the far future. Plus, networkx
is not the only module producing output that depends on the output of the hash function and I'd like to find a solution to fix them all right now without having to wait for external projects or having to carry local patches.
So my question boils down to: is it possible to modify the Python hash function as it is used by all modules that I import?
Or is there another solution that lets me use modules that produce output which depends on the hash function in a way such that it is deterministic between 32bit and 64bit architectures?
EDIT: as suggested I'm making this answer more specific. So I'm choosing networkx as an example to demonstrate the problem. But I'd really want a general solution I can apply to other modules I import as well.
Consider the following graph test.xml
in graphml:
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph edgedefault="directed">
<node id="n1616"/> <node id="n48"/> <node id="n3637"/> <node id="n2842"/>
<node id="n2530"/> <node id="n2396"/> <node id="n6453"/> <node id="n278"/>
<node id="n1209"/> <node id="n92"/> <node id="n793"/> <node id="n3631"/>
<node id="n341"/> <node id="n3151"/> <node id="n1717"/> <node id="n890"/>
<node id="n11399"/> <node id="n203"/> <node id="n1928"/> <node id="n555"/>
<node id="n156"/> <node id="n553"/> <node id="n2524"/> <node id="n3396"/>
<node id="n1741"/> <node id="n4117"/> <node id="n959"/> <node id="n1667"/>
<node id="n6489"/> <node id="n4973"/> <node id="n2247"/> <node id="n927"/>
<node id="n1211"/> <node id="n5467"/> <node id="n450"/> <node id="n1727"/>
<node id="n3531"/> <node id="n6357"/> <node id="n317"/> <node id="n37"/>
<node id="n14349"/> <node id="n1530"/> <node id="n12429"/>
<node id="n249"/> <node id="n348"/> <node id="n3285"/> <node id="n2518"/>
<node id="n406"/> <node id="n2034"/> <node id="n2855"/> <node id="n6"/>
<node id="n4742"/> <node id="n125"/> <node id="n281"/> <node id="n44"/>
<node id="n924"/> <node id="n926"/> <node id="n251"/> <node id="n5455"/>
<node id="n666"/> <node id="n3112"/> <node id="n2870"/> <node id="n6452"/>
<node id="n3156"/> <node id="n2299"/> <node id="n416"/> <node id="n4556"/>
<node id="n1832"/> <node id="n89"/> <node id="n2342"/> <node id="n1327"/>
<node id="n1333"/> <node id="n542"/> <node id="n674"/> <node id="n47"/>
<node id="n1174"/> <node id="n102"/> <node id="n1570"/> <node id="n1362"/>
<node id="n9721"/> <node id="n789"/> <node id="n270"/> <node id="n1524"/>
<node id="n4616"/> <node id="n6093"/> <node id="n2386"/>
</graph>
</graphml>
Then the following produces different output depending on the architecture:
PYTHONHASHSEED=0 python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); networkx.write_graphml(g, sys.stdout.buffer)" | md5sum
Though instead it should be the same. I'm not able to come up with a more minimal example because it seems that if I remove a node from the XML input above, the output becomes the same. I don't know why this happens.
It would help if networkx did not rely on the dictionary order for output but instead allowed to sort the output but when I brought this up and filed it as https://github.com/networkx/networkx/issues/1181 the bug was closed after some discussion without accepting my patch. In the meanwhile I fixed part of this problem with PYTHONHASHSEED=0
but that does not fix the problem of different output between 32bit and 64bit architectures.