Performing set_difference on unordered sets

Question

The set_difference algorithm requires the following

The elements in the ranges shall already be ordered according to this same criterion

which is not the case for hash tables.

I'm thinking of implementing a set difference A-B in terms of std::remove_copy where the removal criterion would be the existence of an element of A in the set B.

Is there a standard-valid-fastest-safest way to do it?

Maybe it is faster (I am sure it is safer) to use temporary std::set objects and insert the hash table data into the std::set objects. Then call set_difference() and output the results back into the hash table. I am a proponent of making sure things work first, and then optimize if necessary. — PaulMcKenzie, Mar 28 '14 at 10:46
Well, if you really want to do a temp copy, use std::vector and std::sort, not std::set. It'll be (a lot!) faster and more memory efficient. — ltjax, Mar 28 '14 at 14:22

score 13 · Accepted Answer · edited Nov 07 '16 at 19:26

13

If you have two hash tables, the most efficient way should be to iterate over one of them, looking up each element in the other hash table. Then insert the ones you do not find into some third container. A rough sketch might look like this:

std::vector<int> result;
std::copy_if(lhs.begin(), lhs.end(), std::back_inserter(result),
    [&rhs] (int needle) { return rhs.find(needle) == rhs.end(); });

edited Nov 07 '16 at 19:26

Marti Nito

697
5
17

answered Mar 28 '14 at 11:04

John Zwinck

239,568
38
324
436

I prefer rhs.count(needle) == 0; My main criticism of your answer however is you have just given your algorithm with code but not stated why you think it is the fastest available method. – CashCow Aug 02 '17 at 08:10
3

@CashCow: Or [in C++20, `!rhs.contains(needle);`](https://en.cppreference.com/w/cpp/container/unordered_set/contains), because TIMTOWTDI. :-) – ShadowRanger Jan 31 '20 at 17:08

score 4 · Answer 2 · answered Aug 02 '17 at 08:10

If you have 2 unordered sets A and B of length Na and Nb and you want to do a set-difference, i.e. get all elements of A not in B, then as the look-up in B is constant time, your complexity of simply iterating over A and checking if it is in B is O(Na).

If A is an unordered set and B is a set (or sorted vector etc) then each lookup would be log(Nb) so the full complexity would be O(Na*log(Nb))

Sorting A first would make it (Na * log(Na)) to sort then Na+Nb to do the merge. If Na is significantly smaller than Nb then Na*log(Nb) is significantly smaller than Na+Nb anyway and if Na is getting larger towards Nb then sorting it first isn't going to be any quicker.

Therefore I reckon you gain nothing by sorting A first (by sorting it first, I mean moving it to a sorted collection).

Performing set_difference on unordered sets

2 Answers2

Linked