1

I’m searching for a fast way to build a union of multiple vectors in C++.

More specifically: I have a collection of vectors (usually 15-20 vectors with several thousand unsigned integers; always sorted and unique so they could also be an std::set). For each stage, I choose some (usually 5-10) of them and build a union vector. Than I save the length of the union vector and choose some other vectors. This will be done for several thousand times. In the end I'm only interested in the length of the shortest union vector.

Small example: 

V1: {0, 4, 19, 40}
V2: {2, 4, 8, 9, 19}
V3: {0, 1, 2, 4, 40}
V4: {9, 10} 

// The Input Vectors V1, V2 … are always sorted and unique (could also be an std::set) 

Choose V1 , V3; 
Union Vector = {0, 1, 2, 4, 19, 40} -> Size = 6; 

Choose V1, V4; 
Union Vector = {0,4, 9, 10, 19 ,40} -> Size = 6; 

… and so on … 

At the moment I use std::set_union but I’m sure there must be a faster way.

vector< vector<uint64_t>> collection; 
vector<uint64_t> chosen; 

for(unsigned int i = 0; i<chosen->size(); i++) {
    set_union(collection.at(choosen.at(i)).begin(),
              collection.at(choosen.at(i)).end(),
              unionVector.begin(),
              unionVector.end(),
              back_inserter(unionVectorTmp));
    unionVector.swap(unionVectorTmp);
    unionVectorTmp.clear();
}

I'm grateful for every reference.

EDIT 27.04.2017 A new Idea:

     unordered_set<unsigned int> unionSet;
     unsigned int counter = 0;

     for(const auto &sel : selection){
        for(const auto &val : sel){
            auto r = unionSet.insert(val);
            if(r.second){
                counter++;
            }
        }
    }
Flauer
  • 11
  • 3
  • If you're only interested in the length, why don't you add the lengths together and compare that instead of creating union vectors that will be discarded immediately? – Rakete1111 Apr 26 '17 at 15:06
  • if you are only interested in the length of the shortest union, you should maybe rethink the whole algorithm, as there might be something more efficient than explicitly constructing the unions – 463035818_is_not_an_ai Apr 26 '17 at 15:06
  • @Rakete1111 it is not that easy (just almost), e.g. union of 1 2 3 and 2 3 4 is 1 2 3 4 and has size 4 only – 463035818_is_not_an_ai Apr 26 '17 at 15:07
  • @tobi303 True, missed that point :) – Rakete1111 Apr 26 '17 at 15:08
  • Have you profiled to see where time is spend ? Some `reserve` might help... From your description, I would do similar code. – Jarod42 Apr 26 '17 at 15:12
  • 1
    You can't. You're going to need to look at every element and make sure it's unique. It's possible do this this is O(V*(N+M)) where V is the number of vectors and N and M are the length of the pair of vectors being unioned. That's really the best case. You can also do it just by counting and comparing, the union never needs to be created. – Donnie Apr 26 '17 at 15:16
  • @Flauer [`std::set_intersection`](http://en.cppreference.com/w/cpp/algorithm/set_intersection) is fairly fast. For 2 vectors, add the lengths, and remove the length of the set returned by the intersection function. – Rakete1111 Apr 26 '17 at 15:17
  • @Rakete1111 I already think of that... i will try it! – Flauer Apr 26 '17 at 15:26
  • You will probably need to roll your own loop to process the unions. Since you are only interested in the shortest union, you should remember that one and stop processing the union if it exceeds that length. – Mike Apr 26 '17 at 15:28
  • @Rakete1111 okay now I remember why I only "think of that" ... this solution is very simple for only 2 vectors, for more than two its more complex. Or am I wrong? – Flauer Apr 26 '17 at 15:40
  • @Mike stop the processing if the union exceeds the shortest previous length could save some time, thanks. – Flauer Apr 26 '17 at 15:41
  • @Flauer Yes. Just add every length together, then remove the length of every permutation of the vectors. – Rakete1111 Apr 26 '17 at 15:44
  • _"...with several thousand unsigned integers; always sorted and unique..."_ With 8 bit integers? How did you do that? – D Drmmr Apr 26 '17 at 15:59
  • @DDrmmr I choose uint8_t only for the small example ... normally these are uint64_t – Flauer Apr 26 '17 at 16:19
  • The solution to your problem may be very different if the data would be `uint8_t`. I suggest changing your question to better match your actual problem. – D Drmmr Apr 26 '17 at 16:34

3 Answers3

2

If they're sorted you can roll your own thats O(N+M) in runtime. Otherwise you can use a hashtable with similar runtime

Eric Yang
  • 2,678
  • 1
  • 12
  • 18
0

The de facto way in C++98 is set_intersection, but with c++11 (or TR1) you can go for unordered_set, provided the initial vector is sorted, you will have a nice O(N) algorithm.

  1. Construct an unordered_set out of your first vector
  2. Check if the elements of your 2nd vector are in the set

Something like that will do:

std::unordered_set<int> us(std::begin(v1), std::end(v1));
auto res = std::count_if(std::begin(v2), std::end(v2), [&](int n) {return us.find(n) != std::end(us);}
0

There's no need to create the entire union vector. You can count the number of unique elements among the selected vectors by keeping a list of iterators and comparing/incrementing them appropriately.

Here's the pseudo-code:

int countUnique(const std::vector<std::vector<unsigned int>>& selection)
{
  std::vector<std::vector<unsigned int>::const_iterator> iters;
  for (const auto& sel : selection) {
    iters.push_back(sel.begin());
  }
  auto atEnd = [&]() -> bool {
    // check if all iterators equal end
  };
  int count = 0;
  while (!atEnd()) {
    const int min = 0; // find minimum value among iterators

    for (size_t i = 0; i < iters.size(); ++i) {
      if (iters[i] != selection[i].end() && *iters[i] == min) {
        ++iters[i];
      }
    }

    ++count;
  }
  return count;
}

This uses the fact that your input vectors are sorted and only contain unique elements.

The idea is to keep an iterator into each selected vector. The minimum value among those iterators is our next unique value in the union vector. Then we increment all iterators whose value is equal to that minimum. We repeat this until all iterators are at the end of the selected vectors.

D Drmmr
  • 1,223
  • 8
  • 15