2

I have an application running in Python 3.9.4 where I store class objects in sets (along with many other kinds of objects). I'm getting non-deterministic behavior even when PYTHONHASHSEED=0 because class objects get non-deterministic hash codes. I assume that's because class objects' hash codes come from their addresses in memory.

For example, here are two runs of a little test program, where Before and Equation are classes:

print(hash(Before), hash(Equation), hash(int))
304555224 304593057 271715397

print(hash(Before), hash(Equation), hash(int))
326601328 293027788 273337413

How can I get Python to generate deterministic hash values for class objects?

Is there a metaclass or something that I could monkey-patch so that all class objects, even int, get a hash function that I specify?

Ben Kovitz
  • 4,920
  • 1
  • 22
  • 50

1 Answers1

1

Hash for classes is deterministic within the same process . Yes, in cPython it is memory based - but then you can't simply "move" a class object to another memory address using Python code.

If you happen to use some serialization/de-serialization transforms with the classes, the de-serialized objects will ordinarily be new objects, distinct from the original ones, and therefore will hash differently.

For the note: I could not reproduce the behavior you stated in the question: on the same process, the hashes for the class objects will be the same.

If you are calculating the hashes in different processes, though, the will differ. So, although you don't mention multiprocessing there, I assume that is your working case.

Then, indeed, implementing __hash__ and __eq__ proper methods on the metaclass can allow you a stable, across process, hashing - but you can't do that with built-in classes such as int: those are built in native code and can't be changed on the Python side. On the other hand, despite the hash number shown being different for these built-in classes, whatever you are using to serialize/deserialize your classes (that is what Python does for communicating data across processes, even if you don't do any explicit de/serializing) .

Then we come to, while it is straightforward to add __eq__ and __hash__ methods to a metaclass to your classes, it'd be better to ensure that on deserializing, it would always yield the same object (with the same ID). hash stability, as you put it, could possibly ensure you have always the same class, but it would depend on how you'd write your code: it is a bit tricky to retrieve the object instance that is already inside a set, if you check positively for containship of another instance that matches it - the most straightfoward way would be building a identity-dictionary out of a set, and then use the value:

my_registry_dict = {element: element for element in my_registry_set}
my_class = my_registry_dict[incoming_class]

With this in mind, we can have a custom metaclass that not only add __eq__ and __hash__- and you have to pick what elements of the classes you will want to compare for equality - class.__qualname__ can be a simple and functional attribute to use - but also customize the __new__ method so that upon de-serializing the same class a second time will always re-use the first class object defined in the current process (i.e.: ensuring the "singleton" behavior Python classes enjoy in non-corner cases like yours seems to be)

class Meta(type):
    registry = {}
    def __new__(mcls, name, bases, namespace):
        cls = super().__new__(mcls, name, bases, namespace)
        if cls not in mcls.registry:
            mcls.registry[cls] = cls
        else:
            # reuse the previously created class
            cls = mcls.registry[cls]
        return cls

    def __hash__(cls):
        # when working with metaclasses, using the name `cls` instead of `self``
        # helps reminding us that we are dealing with instances that are
        # actually classes.
        return hash(cls.__qualname__)

    def __eq__(cls, other):
        return cls.__qualname__ == other.__qualname__


jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • Thanks for the thoughtful, extensive answer. I need to think about what you've said here to determine if can solve my problem. I need the program to produce the same results each time I run it, not only on one machine but all machines. For example, if I put some objects into a Set, then when I iterate through the Set, I need the objects to come out in the same sequence each time I run the program. – Ben Kovitz Feb 04 '22 at 23:52
  • If you need the same sequence across runs, then "set" is not your data structure. Set is not designed to be deterministic in Python, and trying to work around it by forcing the hash seed is not the way to go. If you need a deterministic order, pick a criteria for sorting, and convert your set to a list with a `sorted` call before any output. It is also possible to create a "sorted set" class that could maintain an internal, parallel list or dict, and reproduce the input order on fetching elements. If you need further help, please state your requirements and post a follow up question. – jsbueno Feb 05 '22 at 13:32