intersection of n vectors

Question

I'm new to programming and I've recently come across an issue with finding the intersection of n vectors, (int vectors) that have sorted ints. The approach that I came up with has a complexity of O(n^2) and I am using the std::set_intersect function.

The approach that I came up with is by having two vectors: the first vector would correspond to the first vector that I have, and the second would be the second vector. I call set intersection on the two and overwrite to the first vector, then use the vector clear function on the second. I then overwrite the next vector to the second, and repeat the process, and eventually returning the first vector.

I do believe there is a more efficient way of going about this, but at the moment, I can not think of a more efficient manner. Any help on this issue would be much appreciated.

If elements don't repeat within one vector or you can eliminate repetition, then simplest thing is to count how many times each element appears using a map and output it if number of appearances is equal to number of vectors. — zch, Mar 28 '15 at 16:41
How many numbers are in those vectors and what's the possible range of numbers? — Tesseract, Mar 28 '15 at 16:54
there is no limit to the amount of numbers in the vectors, and the range is from 0 to infinity. Each number inside the vector is unique ... — SnG, Mar 28 '15 at 17:10
You could also try to sort all vectors which is O(nlog(n)). Then you can merge them in O(n). So the total comlexity is O(nlog(n)). — Tesseract, Mar 28 '15 at 17:39
Doesn't std::set_intersection require the input to be sorted? At least that's what the documentation that I read said. — David K, Mar 28 '15 at 17:43
you are right! i was mistaken, the numbers in the vector are indeed sorted. — SnG, Mar 28 '15 at 17:57

David K · Accepted Answer · 2015-03-28T20:06:45.463

Fortunately, I think a much tighter bound can be placed on the complexity of your algorithm.

The complexity of std::set_intersection on input sets of size n1 and n2 is O(n1 + n2). You could take your original vectors and intersect them in single-elimination tournament style, that is, on the first round you intersect the 1st and 2nd vectors, the 3rd and 4th, the 5th and 6th, and so forth; on the second round you intersect the 1st and 2nd intersections, the 3rd and 4th, and so forth; repeat until the final round produces just one intersection. The sum of the sizes of all the vectors surviving each round is no more than half the sum of the sizes of the vectors at the start of the round, so this algorithm takes O(N) time (also O(N) space) altogether where N is the sum of the sizes of all the original vectors in your input. (It's O(N) because N + N/2 + N/4 + ... < 2N.)

So, given an input consisting of already-sorted vectors, the complexity of the algorithm is O(N).

Your algorithm merges the vectors in a very different sequence, but while I'm not 100% sure it is also O(N), I strongly suspect that it is.

Edit: Concerning how to actually implement the "tournament" algorithm in C++, it depends on how hard you want to work to optimize this, and somewhat on the nature of your input.

The easiest approach would be to make a new list of vectors; take two vectors from the old list, push a vector onto the new list, merge the two old vectors onto the new vector, destroy the old vectors, hope the library manages the memory efficiently.

If you want to reduce the allocation of new vectors, then re-using vectors (as you already thought to do) might help. If the input data structure is an std::list<std::vector<int> >, for example, you could start by pushing one empty vector onto the front of this list. Make three iterators, one to the new vector, and one to each of the original first two vectors in the list. Take the intersection of the vectors at the last two iterators, writing the result to the first iterator, then clear the vectors at the last two iterators. Move the last two iterators forward two places each, move the first iterator forward one place. Repeat. If you reach a state where one of the last two iterators has reached end() but the other has not, erase all the list elements between the first iterator and the other iterator. Now you have a list of vectors again and can repeat as long as there is more than one vector in the list.

If the input is std::vector<std::vector<int> > then pushing an element onto the front of the list is relatively expensive, so you might want a slightly more complicated algorithm. There are lots of choices, no really obvious winners I can think of.

I thought it is wrong, but it is in fact correct. Every `std::set_intersection` removes number of elements proportional to it's running time, so whole algorithm is running in linear time with respect to total input size. — zch, Mar 28 '15 at 18:37
I'm not sure I agree with your analysis. You seem to be make a lot of implicit assumptions about the distribution of sizes of the vectors. — Nicu Stiurca, Mar 28 '15 at 18:37
@David K, when i was looking at the set_intersection function on the cpp website, it seems that you have to resize the vector once you have set the intersection of two vectors. is there a way to go around using the resize function, since vector resize is linear? — SnG, Mar 28 '15 at 18:50
@sg123456 - linear time erase would be fine for your application, it is not more than paired `set_intersection`, but `resize` to smaller vector of primitive type (no destructor) would probably be `O(1)` in practical implementations anyway. — zch, Mar 28 '15 at 19:00
@SchighSchagh Indeed for the OP's algorithm I do not have an analysis independent of the distribution of vector sizes; that's why I only "strong suspect" it is O(N). (I tried to find a distribution that exceeded O(N), but maybe I never hit the right one.) For the tournament-style, however, I make _no_ assumption about the vector sizes. The size of the intersection of two sets can never be more than half the sum of the sizes of those sets. To be thorough, we should consider what happens with an odd number of vectors, but it turns out it actually reduces running time. — David K, Mar 28 '15 at 19:15
@DavidK - I'm trying to figure out a way to implement the algorithm that you suggested, but I'm having trouble coming up with the code that would be needed. what is the general semantic and style that one goes about in a single elimination style algorithm? — SnG, Mar 28 '15 at 19:29
The choice of implementation depends on various things mentioned in the paragraphs I just added to the answer. A highly optimized but complicated algorithm may not be desirable; it depends on how much of your processing time is actually spent in this algorithm and how hard you're willing to work to reduce it by a few percent. — David K, Mar 28 '15 at 20:10

score 1 · Answer 2 · answered Mar 28 '15 at 20:40

Here is another analysis that shows that your algorithm is already linear.

Suppose you have some collection of vectors and the algorithm repeatedly selects some two vectors from the collection and replaces them with their intersection, until there is one vector left. Your method fits this description. I argue that any such algorithm will spend, in total, linear time in all executions of set_intersection.

Suppose set_intersection takes at most A * (x + y) operations to for vectors of size x and y.

Let K be sum of lengths of all vectors in collection. It starts as size of the input (n) and it cannot fall below zero, so it can change by at most n.

Every time the vectors of sizes (x, y) are combined value of K is decreased by at least (x + y)/2, as result has to be shorter than either input. If we sum this over all calls we get that sum { (x + y)/2 } <= n, as K cannot change by more than n.

From this we can derive that sum { A * (x + y) } <= 2 * A * n = O(n). Left side here is total time spent in set_intersection.

In less formal language - to spend x + y time in set_intersection you need to remove at least (x + y)/2 elements from your collection, so spending more than linear time executing set_intersection would make you run out of elements.

intersection of n vectors

2 Answers2

Linked

Related