I have a pool of data (X1..XN), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.
The result I need is, for example:
X1 equals X3 and X6
X2 is unique
X4 equals X5
(Order of the lines, or order within a line, doesn't matter).
How can I implement that with pair-wise comparisons?
Here's what I have so far:
Compare all pairs (Xi, Xk) with i < k, and exploit transitivity: if I already found X1==X3 and X1==X6, I don't need to compare X3 and X6.
so I could use the following data structure:
map: index --> group
multimap: group --> indices
where group is arbitrarily assigned (e.g. "line number" in the output).
For a pair (Xi, Xk) with i < k :
if both i and k already have a group assigned, skip
if they compare equal:
- if i already has a group assigned, put k in that group
- otherwise, create a new group for i and put k in it
if they are not equal:
- if i has no group assigned yet, assign a new group for i
- same for k
That should work if I'm careful with the order of items, but I wonder if this is the best / least surprising way to solve this, as this problem seems to be somewhat common.
Background/More info: purpose is deduplicating storage of the items. They already have a hash, in case of a collision we want to guarantee a full comparison. The size of the data in question has a very sharp long tail distribution.
An iterative algorithm (find any two duplicates, share them, repeat until there are no duplicates left) might be easier, but we want non-modifying diagnostics. Code base is C++, something that works with STL / boost containers or algorithms would be nice.
[edit] Regarding the hash: For the purpose of this question, please assume a weak hash function that cannot be replaced.
This is requried for a one-time deduplication of existing data, and needs to deal with hash collisions. The original choice was "fast hash, and compare on collision", the hash chosen turns out a little bit weak, but changing it would break backward compatibility. Even then, I sleep better with a simple statement: In case of a collision, you won't get the wrong data. instead of blogging about wolf attacks.