Is there another way to avoid duplication of large hashable objects?

Question

I am processing text and have the need to store large sequences of hashable objects - sometimes strings, sometimes tuples of words, etc. I've been thinking of using the hash function to provide an simple store and retrieve class but with my first approach it is possible that a single hash key might resolve to more than one item. Given that I add a get function that takes the return value of add as an argument I cannot know which item in the list to return.

class HashStore:
    def __init__(self):
        self.uniques = {}

    def add(self, big_hashable):
        hash_value = hash(big_hashable)
        if hash_value not in self.uniques:
            self.uniques[hash_value] = [big_hashable]
        elif big_hashable not in self.uniques[hash_value]:
            self.uniques[hash_value].append(big_hashable)

        return hash_value

Another approach ends up assuring that there is only a single mapping for each unique hashable item.

class SingleStore:
    def __init__(self):
        self.uniques = {}
        self.indexed = {}
        self.index = 0

    def add(self, big_hashable):
        if big_hashable not in self.uniques:
            self.index += 1
            self.uniques[big_hashable] = self.index
            self.indexed[self.index] = big_hashable

        return self.uniques[big_hashable]

This works and assures that the return value of add can be used to return a unique value. It just seems a bit clumsy. Is there a better, more Pythonic way of handling this situation?

I've been ambiguous as to the question. There are two issues - one is that I have millions of objects that are currently using keys ranging from 100s to 1000s of bytes each (the big_hashable thing). Converting those to integers would enable processing of more data than I currently can. Secondly, keeping only a single canonical copy of each big_hashable thing would cut down on memory usage as well, though it is the first issue that is driving my question, because each key is actually a separate copy of the big_hashable thing.

@PeterDeGlopper: You can test whether an object is in a set, but you can't efficiently retrieve the set's copy of the object. — user2357112, Aug 23 '13 at 19:29
That said, I'm not sure whether the question requires that functionality. At first, I interpreted it as being like string interning, in which case you would want to be able to efficiently retrieve the canonical copy of an object, but now I'm not sure. — user2357112, Aug 23 '13 at 19:31
@user2357112 True, though I don't see the requirement for this in the question (and while I can think of cases where that might matter, I can't think of any *good* use case). In any case, if you need that, you can just use a dict and store the canonical object as value. — , Aug 23 '13 at 19:31
@delnan: It looks to me like the OP wants to store a sequence of large, hashable objects, some of which may be equal, and wants to use only one copy of equal objects. That would require the ability to retrieve a canonical copy. That said, it's probably premature optimization. — user2357112, Aug 23 '13 at 19:44
There are a number of collision strategies for hash tables, with plenty of sample code out there. In fact, somewhere on ActiveState, there's a recipe for a pure-Python hash table with exactly the same collision strategy as the CPython 2.something `dict` and `set` implementations. But doing that will obviously be exactly the same as just using `dict` or `set`, except a whole lot slower and more complicated, which is a pretty good argument for just using `dict` or `set` as appropriate… — abarnert, Aug 23 '13 at 20:17
"each key is actually a separate copy of the big_hashable thing" - not if you canonicalize them. — user2357112, Aug 23 '13 at 21:32
OK, thank you. I get your point - I didn't follow it until now. — bmacnaughton, Aug 23 '13 at 21:53

user2357112 · Accepted Answer · 2013-08-23T19:38:41.370

If you don't need to be able to efficiently retrieve a canonical copy of an object given a different copy, you can just use a set:

s = set()
s.add(3)
s.add(3)
# s only has one 3 in it

If you do need to be able to efficiently retrieve canonical copies of objects, don't store them by the hash value - that'd be horribly broken. Just use the hashable directly.

class Interner(object):
    def __init__(self):
        self._store = {}
    def canonical_object(self, thing):
        """Returns a canonical object equal to thing.

        Always returns the same result for equal things.

        """

        return self._store.setdefault(thing, thing)

With the weakref module, you can improve this to not keep a canonical object if the client code lets go of it, just like the built-in intern function does for strings.

Let me clarify the question and the reason for it. I apologize for not being more clear to begin with. — bmacnaughton, Aug 23 '13 at 20:31

Is there another way to avoid duplication of large hashable objects?

1 Answers1