Efficient, or fast, size of the set intersection of two vectors

Question

I find myself needing to return the size of the intersection of two vectors:

std::vector<int> A_, B_

I do not require the intersected values, just the size of the set. This function needs to be called a very large number of times. This is part of a much bigger simulation over a (mathematical) graph/network.

My working conditions are:

Containers are vectors. To change them is pure pain, but would certainly do so if the gain warrants it.
The size of A_ and B_ have an upper bound of ~100. But are often much smaller.
Elements of A_ and B_ represent samples taken from {1,2,...,M}, where M >10,000.
In general, A_ and B_ have similar, but unequal, sizes.
Both vectors are unordered.
The contents of A_ and B_ change, as part of the "bigger simulation".
Each vector contains only unique elements i.e. no repeats.

My first attempt, using a naive loop, is below. But I think this may not be enough. I've assumed...that std::set_intersection will be too onerous due to repeated sorts and allocations.

   int vec_intersect(const std::vector<int>& A_, const std::vector<int>& B_) {

      int c_count=0;

  for(std::vector<int>::const_iterator it = A_.begin(); it != A_.end(); ++it){
     for(std::vector<int>::const_iterator itb = B_.begin(); itb != B_.end(); ++itb){

      if(*it==*itb) ++c_count;
     }
  }

  return c_count;
}

Given my conditions above, how else can I implement this to gain speed, relatively easily? Should I be thinking about hash tables or going with sorts and STL, or different containers?

Since you are willing to change the data structure, if you use a `std::set` instead of `std::vector`, you get sorting for free. Then use `set_intersection`. — Praetorian, Jun 21 '14 at 01:57
You've assumed that the easiest method (`set_intersection`) will be what? Too slow? Try it. If it's a bottleneck, then you can move on to something else. Don't assume it will be a bottleneck until you've isolated it as a problem by profiling it. — Chad, Jun 21 '14 at 01:59
Changing to `set` means you pay the "sorting" price for all use cases, even if it's unnecessary for any specific use case. — Chad, Jun 21 '14 at 02:00
@Praetorian: `std::set` has bad constant factors for a lot of scenarios due to being a tree based structure (e.g. poor cache locality). If the usage model is such that the number of edits relative to the size of the data is large, then `set` will probably win. If the number of edits is small, the amortized cost of sorting the data will probably win. — Billy ONeal, Jun 21 '14 at 02:02
@RusanKax You probably get best performance by sorting the arrays and doing O(N) pass over them to find the number of matches. Then you can focus on finding the sorting algorithm which performs best for your specific arrays. — JarkkoL, Jun 21 '14 at 02:18
@JarkkoL: Even in the best case that would be O(m lg n) -- treating a sorted array as a set has lg n lookup time. — Billy ONeal, Jun 21 '14 at 02:20
@BillyONeal Do you know of an algorithm that performs better than O(n*lgn)? — JarkkoL, Jun 21 '14 at 02:32
@Jarkk: Hash table is probabilistically O(m + n) (see example in my answer) — Billy ONeal, Jun 21 '14 at 02:35
@BillyONeal Yeah, but due to cache coherency I doubt it would perform better. Good point in your post about needing to sort only one array though. — JarkkoL, Jun 21 '14 at 02:43
@Praetorian: *use a `std::set` […] you get sorting for free*. Quite on the contrary, insertions are `O(log N)`, thus `N` insertions are `O(N log N)` operations, with the additional penalty for uncontiguous memory. Insertion into a sorted vector of ~100 `int` is probably faster than insertion into a set of similar size (`O(log N)` to find the insert + expensive `O(1)` [`memmove`] to make space for the element, compared with `O(log N)` to find the location in the set + expensive `O(1)` [`new`]. If you library does not do `memmove`, choose a different standard library. — David Rodríguez - dribeas, Jun 21 '14 at 03:19

Sergey Kalinichenko · Accepted Answer · 2014-06-21T02:39:50.657

17

Your algorithm is O(n²) in the number of elements (assuming that the size of both vectors is approximately equal to n). Here is an O(n) algorithm:

Create an std::unordered_set<int>
Put all items of vector A into the set
Go through all items of vector B, checking that they are present in the unordered_set, and incrementing the count for each item that is present.
Return the final count.

Here is an implementation in C++11, using a lambda for brevity:

vector<int> a {2, 3, 5, 7, 11, 13};
vector<int> b {1, 3, 5, 7, 9, 11};
unordered_set<int> s(a.begin(), a.end());
int res = count_if(b.begin(), b.end(), [&](int k) {return s.find(k) != s.end();});
// Lambda above captures the set by reference. count_if passes each element of b
// to the lambda. The lambda returns true if there is a match, and false otherwise.

(this prints 4; demo)

edited Jun 21 '14 at 02:39

answered Jun 21 '14 at 01:58

Sergey Kalinichenko

714,442
84
1,110
1,523

3

*probabilistic O(n)* `set_intersection` is worst-case O(n) but requires the input be sorted. – Billy ONeal Jun 21 '14 at 02:01
Thanks to dasblinkenlight and Billy ONeal. In my use case above, many of the trials will return empty intersections (because M>10,000 and A_.size() and B_.size() are much smaller, as in my original post). So maybe I'll play with this to optimise that out. But this is already double Jesus quicker. Cheers! – Rusan Kax Jun 21 '14 at 17:03

Billy ONeal · Answer 2 · 2014-06-21T02:33:47.310

Your algorithm is O(n*m), where n and m are the number of elements in the vectors.

If you don't have issues where the input data is untrusted, you'll probably have the best results with:

Place all the elements of A into an unordered_set
For each element in B, if it is in the set, increment your counter.

For example:

int vec_intersect(const std::vector<int>& A_, const std::vector<int>& B_)
{
    std::unordered_set<int> aSet(A_.cbegin(), A_.cend());
    return std::count_if(B_.cbegin(), B_.cend(), [&](int element) {
        return aSet.find(element) != aSet.end();
        });
}

This will probabilistically give O(m + n) results. (Hash tables are almost always O(1), but if an attacker can force many collisions in the table they could force O(n) behavior, leading to denial of service)

If you require deterministic results, and the order of the vectors does not matter, sorting one vector will work, which is only O(m lg m + m + n). That is:

Sort the first vector
For each element in the second vector, use binary search to determine if the element is in the first vector, and if so, increment your counter.

For example:

int vec_intersect(std::vector<int>& A_, const std::vector<int>& B_)
{
    std::sort(A_.begin(), A_.end());
    return std::count_if(B_.cbegin(), B_.cend(), [&](int element) {
        return std::binary_search(A_.cbegin(), A_.cend(), element);
        });
}

Just for giggles, here's an <algorithm>-ized version of your algorithm:

int vec_intersect(const std::vector<int>& A_, const std::vector<int>& B_)
{
    return std::count_if(B_.cbegin(), B_.cend(), [&](int element) {
        return std::find(A_.cbegin(), A_.cend(), element) != A_.cend();
        });
}

Efficient, or fast, size of the set intersection of two vectors

2 Answers2