finding items to de-duplicate

Question

I have a pool of data (X₁..X_N), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.

The result I need is, for example:

X₁ equals X₃ and X₆
X₂ is unique
X₄ equals X₅

(Order of the lines, or order within a line, doesn't matter).

How can I implement that with pair-wise comparisons?

Here's what I have so far:

Compare all pairs (X_i, X_k) with i < k, and exploit transitivity: if I already found X₁==X₃ and X₁==X₆, I don't need to compare X₃ and X₆.

so I could use the following data structure:

  map: index --> group
  multimap: group --> indices

where group is arbitrarily assigned (e.g. "line number" in the output).

For a pair (X_i, X_k) with i < k :

if both i and k already have a group assigned, skip
if they compare equal:
- if i already has a group assigned, put k in that group
- otherwise, create a new group for i and put k in it
if they are not equal:
- if i has no group assigned yet, assign a new group for i
- same for k

That should work if I'm careful with the order of items, but I wonder if this is the best / least surprising way to solve this, as this problem seems to be somewhat common.

Background/More info: purpose is deduplicating storage of the items. They already have a hash, in case of a collision we want to guarantee a full comparison. The size of the data in question has a very sharp long tail distribution.

An iterative algorithm (find any two duplicates, share them, repeat until there are no duplicates left) might be easier, but we want non-modifying diagnostics. Code base is C++, something that works with STL / boost containers or algorithms would be nice.

[edit] Regarding the hash: For the purpose of this question, please assume a weak hash function that cannot be replaced.

This is requried for a one-time deduplication of existing data, and needs to deal with hash collisions. The original choice was "fast hash, and compare on collision", the hash chosen turns out a little bit weak, but changing it would break backward compatibility. Even then, I sleep better with a simple statement: In case of a collision, you won't get the wrong data. instead of blogging about wolf attacks.

With a good hash function, the probability of a collision is approximately zero, so you should worry only about the common case of all items being equal. — David Eisenstat, Jul 22 '13 at 15:16
@DavidEisenstat: this is a trust issue more than a probability one. The direct damage of a one in a trillion collision may be negligible, convincing customers of that is an entirely different matter. — peterchen, Jul 22 '13 at 15:53
@DavidEisenstat: The current format allows for deduplication already but does it only in a few "easy" cases. For most users this will be a one time transparent maintenance operation - unless they switch between old and new versions frequently. — peterchen, Jul 22 '13 at 15:57
I wasn't suggesting that you could leave out the expensive checks, just that optimizing it after cryptographic hashes compared equal was probably a waste of time in all but the non-"wolf attack" case. — David Eisenstat, Jul 22 '13 at 15:59

Leonid Volnitsky · Answer 1 · 2013-07-22T21:02:05.650

1

Make hash of each item. Make a list of pair<hash,item_index>. You can find groups by sorting this list by hash or putting it into std::multimap.

When you output group list, you need compare items for hash collisions. So for each item you will do one hash calculation and about one comparison. And sorting of hash-list.

edited Jul 22 '13 at 21:02

answered Jul 22 '13 at 15:13

Leonid Volnitsky

8,854
5
38
53

But if you have more than 2 items with the same hash, we're back to the original problem, in which case, you will need more than one comparison per item. – darksky Jul 22 '13 at 20:53
Hash collisions are rare with good hash, so it will be *about* one comparison per item. I've added `about` to my answer. – Leonid Volnitsky Jul 22 '13 at 21:00

score 1 · Answer 2 · answered Jul 22 '13 at 15:25

So... you already have a hash? How about this:

sort and group on hash
print all groups with size 1 as unique
compare collisions

Tip for comparing colisions: Why not just rehash them with a different algorithm? Rinse, repeat.

(I am assuming you are storing files/blobs/images here and have hashes of them and that you can slurp the hashes into memory, also, that the hashes are like sha1/md5 etc., so collisions are very unlikely)

(also, I'm assuming that two different hashing algorithms will not collide on different data, but this is probably safe to assume...)

darksky · Accepted Answer · 2013-07-22T22:01:34.760

Here's another, maybe simpler, data structure for exploiting transitivity. Make a queue of comparisons that you need to do. For example, in case of 4 items, it will be of [ (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) ]. Also have an array for comparisons you've already done. Before each comparison, check to see if that comparison has been done before, and every time you find a match, go through the queue and replace the matching item index with its lower index equivalent.

For example, suppose we pop (1,2), compare, they're not equal, push (1,2) to the array of already_visited and continue. Next, pop (1,3) and find that they are equal. At this point, go through the queue and replace all 3's with 1's. The queue will be [(1,4), (2,1), (2,4), (1,4)], and so on. When we reach (2,1), it has already been visited, so we skip it, and the same with (1,4).

But I do agree with the previous answers. Since comparisons are computationally expensive, you probably want to compute a fast, reliable, hash table first, and only then apply this method to the collisions.

I didn't end up using this - but accepted since you pushed me in the right direction :) `map > >` for each hash with multiple occurences, sharing the `set`instances between all with the same data. --- since the "main" application guarantees comparison in case of hash conflict, thsi maintenance op will do also; as said it's not a probability but a trust issue. — peterchen, Jul 24 '13 at 09:52

score 0 · Answer 4 · answered Jul 22 '13 at 18:06

I agree with the idea to use a second (hopefully improved) hash function so you can resolve some of your weak hash's collisions without needing to do costly pairwise comparisons. Since you say you are having memory limitation issues, hopefully you can fit the entire hash table (with secondary keys) in memory, where for each entry in the table you store a list of record indices for the records on disk that correspond to that key pair. Then the question is whether for each key pair, whether you can load all the records into memory that have that key pair. If so, then you can just iterate over key pairs; for each key pair, free any records in memory for the previous key pair and load the records in memory for the current key pair, and then do comparisons among these records like you already outlined. If you have a key pair where you can't fit all records into memory, then you'll have to load partial subsets, but you should definitely be able to maintain in memory all the groups (with a unique record representative for each group) you have found for the key pair, since the number of unique records will be small if you have a good secondary hash.

finding items to de-duplicate

4 Answers4