Checking for duplicate arrays when I have a huge amount of arrays

Question

I am counting various patterns in graphs, and I store the relevant information in a defaultdict of lists of numpy arrays of size N, where the index values are integers.

I want to efficiently know if I am creating a duplicate array. Not removing duplicates can exponentially grow the amount of duplicates to the point where what I am doing becomes infeasible. But there are potentially hundreds of thousands of arrays, stored in different lists, under different keys. As far as I know, I can't hash an array.

If I simply needed to check for duplicate nonzero indices, I would store the nonzero indices as a bit sequence of ones and then hash that value. But, I don't only need to check the indices - I need to also check their integer values. Is there any way to do this short of coming up with a completely knew design that uses different structures?

Thanks.

If your arrays are the same size why don'y you use a matrix and remove the duplicates by using `np.unique(array, axis=0)`? — Mazdak, Oct 05 '18 at 19:41
I don't know how many arrays I will have at any given iteration, and I think appending numpy arrays is expensive(someone can correct me if I am wrong please). I could do this and it will *partially* solve the problem, but I am not sure how much benefit it would be. — Travis Black, Oct 05 '18 at 19:43
Well if you're generating arrays through iteration if each iteration is supposed to take a considerable time then your best bet would be using tuples which are immutable and hashable plus a set to keep the unique arrays. Otherwise just use a numpy array (You can create that array using `np.fromiter` function) and remove the duplicates using `np.unique(arr, axis=0)`. Also don't forget that you always can run benchmarks to see if the cost of converting your iterable into a numpy array and performing `np.unique` will overcome the cost of removing the duplicates. — Mazdak, Oct 05 '18 at 19:47
"I can't hash an array": `ndarray`s do not work with the built-in `hash` function (to prevent them from being used as `dict` keys, which is another story) however the underlying data absolutely can be hashed, for example with `hash(x.tobytes())`. Would that help to solve your problem? — myrtlecat, Oct 05 '18 at 21:21

score 1 · Accepted Answer · answered Oct 06 '18 at 01:57

The basic idea is “How can I use my own hash (and perhaps ==) to store things differently in a set/dict?” (where “differently” includes “without raising TypeError for being non-hashable).

The first part of the answer is defining your hash function, for example following myrtlecat’s comment. However, beware the standard non-answer based on it: store the custom hash of each object in a set (or map it to, say, the original object with a dict). That you don’t have to provide an equality implementation is a hint that this is wrong: hash values aren’t always unique! (Exception: if you want to “hash by identity”, and know all your keys will outlive the map, id does provide unique “hashes”.)

The rest of the answer is to wrap your desired keys in objects that expose your hash/equality functions as __hash__ and __eq__. Note that overriding the non-hashability of mutable types comes with an obligation to not alter the (underlying) keys! (C programmers would often call doing so undefined behavior.)

For code, see an old answer by xperroni (which includes the option to increase safety by basing the comparisons on private copies that are less likely to be altered by some other code), though I’d add __slots__ to combat the memory overhead.

Checking for duplicate arrays when I have a huge amount of arrays

1 Answers1