Python hash: identity vs. equivalence

Question

I have a class whose instances are to be distinguished by an identity that is different from the data values they carry. In my code, I intend to use == to mean two instances are equivalent regarding their data, and is to mean two variables refer to the same instance, that is, that they are identical. This is all quite normal, to my understanding.

Furthermore I want to use instances (equivalent or not) in sets, and as keys in dicts. That requires the __hash__ function to be defined for the class.

But in this regard I don't understand the documented requirement of __hash__:

The only required property is that objects which compare equal have the same hash value.

Does that mean that two distinct but equivalent objects cannot be used as different keys in a dict, or appear individually in a set? In the code below I break that requirement by overriding __eq__ and __hash__ to reflect my intended use. It works as expected in Python 2.7 and 3.7.

What are the negative consequences of breaking the requirement of __hash__ as I've done here?

Is there a better way to accomplish my goal?

class A( object ):
        def __init__( self, v1, v2 ):
                self.v = ( v1, v2 )

        def __eq__( self, b ):
                return self.v[0] == b.v[0] and self.v[1] == b.v[1]

        def __hash__( self ):
                return id( self )

        def __str__( self ):
                return str( self.v )

p = A( 1, 0 )
q = A( 1, 0 )

print( str( p ), str( q ) )
print( "identical?", p is q )
print( "equivalent?", p == q )
print( "hashes", hash(p), hash(q) )

s = set( [p, q] )
print( "set length", len( s ) )
print( "both in set?", p in s, q in s )

d = { p:3, q:4 }
print( "dict length", len( d ) )
print( "as distinct keys", d[p], d[q] )

Your example *shows* the problem; although p and q are considered equal, they are distinct in sets and dictionaries. `d[A(1, 0)]` is *probably* a key error even though there are (two!) keys that match the value. Note that neither `__eq__` nor `__hash__` is used to determine identity, you can't override the result of `id`. — jonrsharpe, Oct 11 '19 at 17:59
If `__eq__()` would return True, then `__hash__()` *must* return the same value for both objects - otherwise, a dict/set would never get to the point of even calling `__eq__()` on those objects. — jasonharper, Oct 11 '19 at 18:00
You have to ask yourself: "If objects are equal, why do I need them both to appear in a set like distinct elements?" The correct answer will be "I really dont need it". — sanyassh, Oct 11 '19 at 18:22
Hi @jonsharpe, I guess you misunderstood. I do not intend to look up a new instance of A. As in my example, I create an A, put it in a set or dict, then look it up later by *identity*, not by *value*. The code I wrote seems to do just that, and produces the results I expect. — Steve White, Oct 11 '19 at 19:24
@sanyash You deny my use-case. Clearly I only need air to breathe, food to eat, and shelter against the elements. I do not _need_ to key a dict according to identity of keys. But let's say that is my desire. How do I fulfill my desire? — Steve White, Oct 12 '19 at 09:55
You already got what you want with `def __hash__( self ): return id( self )`. I just meant that it doesn't have any practical use. Negative consequences of breaking the requirement of `__hash__`: usage of set and dict becomes useless. Set must contain elements which give different results on `__eq__`. But your code breaks this. — sanyassh, Oct 12 '19 at 10:01
Hi @sanhash. I do not grasp your meaning. In what sense are set and dict useless? I can add objects of A to them, I can look them up in the dict and retrieve them. I pop them off of set. These are the sorts of things I want to do. What use do you have in mind, that I have not listed? — Steve White, Oct 12 '19 at 14:12

jsbueno · Answer 1 · 2019-10-12T14:06:29.153

The only required property is that objects which compare equal have the same hash value.

The "compare equal" in the spec text means the result of their __eq__ methods - there is no requirement that they are the same object.

The __hash__, thought, have to be based in the values that are used in __eq__, not in the object's "id" - that part is incorrect in your code. For it to work, this is how it would have to be:

Just do:

      ...
      def __eq__( self, b ):
           return self.v[0] == b.v[0] and self.v[1] == b.v[1]

      def __hash__( self ):
           return hash((self.v[0], self.v[1]))

Does that mean that two distinct but equivalent objects cannot be used as different keys in a dict, or appear individually in a set?

Yes. This is what the spec means.

The workaround for that is to leave the default __eq__ implementation for your class to conform to the rules, and implement an alternate form of comparison that you will have to use in your code.

The most straightforward way is just to leave the default implementation of __eq__ as it is, which compares by identity, and have an arbitrary method that you use for comparison,( the idiom that code in languages that do not support operator overriding have to use anyway):

class A( object ):
    ...
    def equals( self, b ):
       return self.v[0] == b.v[0] and self.v[1] == b.v[1]

p = A( 1, 0 )
q = A( 1, 0 )

print( str( p ), str( q ) )
print( "identical?", p is q )
print( "equivalent?", p.equals(q) )

There are ways to improve a little on this - but the baseline is: __eq__ have to conform to the rules, and do an identity comparison.

One way is to have an internal associated object that works as a "namespace" which you can use for comparison:

class CompareSpace:
    def __init__(self, parent):
        self.parent = parent

        def __eq__( self, other ):
            other = other.parent if isinstance(other, type(self)) else other 
            return self.parent.v[0] == other.v[0] and other.v[1] == b.parent.v[1]


    class A:
        def __init__( self, v1, v2 ):
            self.v = ( v1, v2 )
            self.comp = CompareSpace(self)

        def __str__( self ):
            return str( self.v )

p = A( 1, 0 )
q = A( 1, 0 )

print( str( p ), str( q ) )
print( "identical?", p is q )
print( "equivalent?", p.comp == q )
print( "hashes", hash(p), hash(q) )

demonstration of brokenness

Now I will insert an example of how this breaks - I am creating a class deliberatly more broken, to ensure the problem occurs at first try. But if the problem occurs even once each 2 million times, your code will still be too broke to use in anything real, even if personal code: you will have a dictionary that is not deterministic:


class Broken:
    def __init__(self, name):
        self.name = name
    def __hash__(self):
        return id(self) % 256
    def __eq__(self, other):
        return True
    def __repr__(self):
        return self.name


In [23]: objs = [Broken(f"{i:02d}") for i in range(64)]                                        

In [24]: print(objs)                                                                           
[00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]

In [25]: test = {}                                                                             

In [26]: for obj in objs: 
    ...:     if obj not in test: 
    ...:         test[obj] = 0 
    ...:                                                                                       

In [27]: print(test)                                                                           
{00: 0, 01: 0, 02: 0, 11: 0}

# Or with simple, unconditional, insertion:
In [29]: test = {obj: i for i, obj in enumerate(objs)}                                         

In [30]: test                                                                                  
Out[30]: {00: 57, 01: 62, 02: 63, 11: 60}

(I repeat, while your has values won't clash by themselves, internal dict code have to reduce the number in the hash to an index in to its hash table - not necessarily by module (%) - otherwise every empty dict would need 2 ** 64 empty entries, and only if all hashes were only 64bit wide)

Evidently I was not clear. The code as I presented does what I intended. Your code does what I do not want. It's as if... Do we agree that equvalence and identity are two different concepts? — Steve White, Oct 11 '19 at 18:30
so, the answer is "no" you can't have things comparing equal and having a different hash as members of sets or dictionary keys. I'd suggest you to leave the default `__eq__` which compares by "id" and use a manual comparison method to compare your data (you will loose the "==" operator but will be able to write a `if obj_a.eq(obj_b): ` idiom.) — jsbueno, Oct 11 '19 at 19:13
I edited the answer in a way you can be better served. Another option is to use wrappers, like in @MSeifert 's answer, instead of an internal associated object - either way you will need a helper class to be able to use `==` — jsbueno, Oct 11 '19 at 19:41
Hi, jsbueno. The question is not, "is it allowed". And I considerd leaving '==' to mean the same thing as 'is', and writing some other function to properly distinguish equivalence from identity -- but it seems so wrong, considering the usual definitions of those operators, and considering the code I wrote seems to do just what I want. Again: what exactly is the negative consequence of code such as I wrote? We have yet to see an example of that. — Steve White, Oct 11 '19 at 20:48
The `__eq__` is used by the set and dict algorithms in case of hash collision. Of course, the hash being the ID, it will never be "equal" between two objects - but in real life structures, the hash if used modulo a reserved size for the data structure. Let's say that a small dictionary have 64 positions - it is very possible you get a collision there. And then, Python might treat two distinct objects as the same, since they compare equal, and you get your distinct object overwritten in your data structure. That is: your set of dict will be unreliable — jsbueno, Oct 12 '19 at 02:30
Hi jsbueno, I'm afraid I don't quite see what you're getting at. What "real life situations" are we discussing here? Do you mean something different from my code example? In my code example, I use the object's id in `__hash__`. About the internals of the dict algorithms I know very little. Can you explain how code like what I wrote would result in a "collision"? I have not seen that happen. Better: could you produce an example? — Steve White, Oct 12 '19 at 07:34
Hi, I'm trying to piece together what you and other responders have said, to arrive at an explanation of the problem. (I have not yet read up on the internal workings of dict.) Here is my guess: the algorithms do not internally maintain the information returned by `__hash__`, so the originally unique values given by id result in non-unique internal representation, causing (occasional) "collisions", which are resolved by the heuristic of calling `__eq__`, which will fail if the two objects are equivalent. Is that close? — Steve White, Oct 12 '19 at 14:29

score 0 · Answer 2 · answered Oct 20 '19 at 09:00

I have done more testing. The result is, despite the lack of documentation, and despite the warnings that something might go wrong,

the code as I wrote it never fails.

I have now added many billions of objects to dicts and sets, on 64 bit and 32 bit platforms, with CPython 2.7 and 3.0 and with PyPy. I have tried it on a bigger machine where I added well over 2 billion objects at once to a single set. It worked perfectly. I have never witnessed a collision with the code as presented in the OP.

This is not a fluke or an accident.

Somebody went to some pains to ensure this behavior. The mystery is, why isn't it documented?

The best I could make out from the other postings, the concern is that the container algorithms somehow lose the uniqueness guaranteed by the id() function in the OP class A, and when that happens, a collision occurs, and __eq__ is called.

That may occur on some platform and some implementation of Python. But everywhere I have tried it, it never happens.

It may have to do with a couple of undocumented properties: for any class instance obj,

hash( id( obj ) ) == hash( obj )
# and
hash( hash( obj ) ) == hash( obj )

(In fact, hash( id( x ) ) is not always hash( x ). Try x = -2. In fact, it seems to be the case that for object instances obj, hash( obj ) == id( obj ) >> 16. But that strikes me as something that might be implementation-dependent.)

To see when or how the code might break, I tested with the code below. The idea is, if some instance of A will somehow collides with a new instance, it will fail to be put in the set, because __eq__ fails to break the tie. This code checks if that ever happens. I have never seen it. Please try it yourself, and let me know what OS, what version of Python you're using!

Be careful --- you can use up all your system resources and crash your computer. Bring up a console and run top in it, to see what's going on. Using the OP definition of class A:

from __future__ import print_function
from sys import stdout

class A( object ):
    def __init__( self, v1, v2 ):
            self.v = ( v1, v2 )

    def __eq__( self, b ):
            return self.v[0] == b.v[0] and self.v[1] == b.v[1]

    def __hash__( self ):
            return id( self )

    def __str__( self ):
            return str( self.v )

NINSTANCES = 3000000    # play with this number -- carefully!
STATUS_INTERVAL = 100000

def test():
        """ hammer the set algorithms """
        s = set()
        instances = []
        for i in range( 0, NINSTANCES ):
                p = A( 1, 0 )
                s.add( p )
                instances.append( p )
                if not i % STATUS_INTERVAL:
                        stdout.write( str( i // STATUS_INTERVAL ) + " " )
                        stdout.flush()
        stdout.write( "\n" )

        print( "length of set", len( s ) )
        print( "number of instances", len( instances ) )

        for i in instances:
                if not i in s:
                        print( "INSTANCE DROPPED OUT!" )
test()

Why are data hashes (as opposed to `__hash__`) used at all for lookups? It is that for some reason, the original key object is not available -- especially, if the container is to be stored past the run time of the program. Then a new, somehow "equivalent" key can still be used to retrieve items. This is not the use-case for which the code in the OP was intended, however. I wonder -- is there a way that both situations could be supported, *without* abandoning the meaning of the `==` operator as "equivalent"? — Steve White, Oct 23 '19 at 13:08

Python __hash__: identity vs. equivalence

2 Answers2

demonstration of brokenness

Python hash: identity vs. equivalence