How to get a sorted subvector out of a sorted vector, fast

Question

I have a data structure like this:

struct X {
  float value;
  int id;
};

a vector of those (size N (think 100000), sorted by value (stays constant during the execution of the program):

std::vector<X> values;

Now, I want to write a function

void subvector(std::vector<X> const& values, 
               std::vector<int> const& ids, 
               std::vector<X>& out /*, 
               helper data here */);

that fills the out parameter with a sorted subset of values, given by the passed ids (size M < N (about 0.8 times N)), fast (memory is not an issue, and this will be done repeatedly, so building lookuptables (the helper data from the function parameters) or something else that is done only once is entirely ok).

My solution so far:
Build lookuptable lut containing id -> offset in values (preparation, so constant runtime)
create std::vector<X> tmp, size N, filled with invalid ids (linear in N)
for each id, copy values[lut[id]] to tmp[lut[id]] (linear in M)
loop over tmp, copying items to out (linear in N)

this is linear in N (as it's bigger than M), but the temporary variable and repeated copying bugs me. Is there a way to do it quicker than this? Note that M will be close to N, so things that are O(M log N) are unfavourable.

Edit: http://ideone.com/xR8Vp is a sample implementation of mentioned algorithm, to make the desired output clear and prove that it's doable in linear time - the question is about the possibility of avoiding the temporary variable or speeding it up in some other way, something that is not linear is not faster :).

And what is the purpose of that `tmp`? Where did it come from in the first place? Why aren't you building your output directly in `out` without any intermediate temporaries? — AnT stands with Russia, Nov 29 '10 at 23:05
Also, what you are trying to build is not well-described in your question. Initially, you seem to say that you need output of size `M`. Yet your algorithm attempts to build output of size `N` in all cases. So, what is it you are trying to get in `out` array after all is done? — AnT stands with Russia, Nov 29 '10 at 23:08
regarding "where does tmp come from" - i created it. regarding "why am i not building it in `out` directly" - i don't know where to place the element beforehand, i don't know the position in the subvector. and no, my output is size `M`, it's only linear in N because i test each element in tmp. and yes, the `id` values are unique. — etarion, Nov 29 '10 at 23:17
A second vector sorted by `id` and using `equal_range`, `copy` and finally `sort` by value should give you `M log N` complexity. — clstrfsck, Nov 29 '10 at 23:19
That's one thing i missed to mention - M will be pretty close to N, so this will be unfavourable (for very sparse ids, it would be favorable) — etarion, Nov 29 '10 at 23:22
regarding `vector`: In theory, no. In practice, i'm looping 2 passes over `out` later, so anything that makes iteration slower might be unfavourable. I've thought about using a sorted container, but insertion into those is logarithmic and result in O(M log M) or similar complexity, and using a std::list prevents the random access that i need to use the LUT, that's why i didn't use them - if you know a better way, i'll be happy to hear it. — etarion, Nov 29 '10 at 23:42
spong: why do you think it's not sorted? it was sorted before, and the index is identical in the copy to tmp, so it stays sorted (with holes) there, and the second pass just removes the holes. — etarion, Nov 29 '10 at 23:44
Yes indeed, my mistake. Need to improve comprehension skills :( — clstrfsck, Nov 29 '10 at 23:50
@etarion is the range of of ids O(N)? Or are the ids arbitrary? — nbourbaki, Nov 30 '10 at 03:43
the range of the ids is [0, N). for every 0 <= id < N there is exactly one element in `values`. — etarion, Nov 30 '10 at 12:49
Okay, the light just went on. Apologies for the confusion. Seems like you could use a "sparse array" container to cut down on the size of your temporary array, but I don't see a way around it outside of sorting. — Mark Storer, Nov 30 '10 at 16:36

score 2 · Accepted Answer · answered Nov 30 '10 at 01:39

An alternative approach you could try is to use a hash table instead of a vector to look up ids in:

void subvector(std::vector<X> const& values, 
               std::unordered_set<int> const& ids, 
               std::vector<X>& out) {

    out.clear();
    out.reserve(ids.size());
    for(std::vector<X>::const_iterator i = values.begin(); i != values.end(); ++i) {
        if(ids.find(i->id) != ids.end()) {
            out.push_back(*i);
        }
    }
}

This runs in linear time since unordered_set::find is constant expected time (assuming that we have no problems hashing ints). However I suspect it might not be as fast in practice as the approach you described initially using vectors.

Thanks, this looks interesting. Will benchmark against the vector version. — etarion, Nov 30 '10 at 21:09

score 1 · Answer 2 · answered Nov 29 '10 at 23:01

1

Since your vector is sorted, and you want a subset of it sorted the same way, I assume we can just slice out the chunk you want without rearranging it.

Why not just use find_if() twice. Once to find the start of the range you want and once to find the end of the range. This will give you the start and end iterators of the sub vector. Construct a new vector using those iterators. One of the vector constructor overloads takes two iterators.

That or the partition algorithm should work.

answered Nov 29 '10 at 23:01

Jay

13,803
4
42
69

Not sure this will work. If I read the question correctly, the OP has the array sorted by `value` and wants to select by `id`. – clstrfsck Nov 29 '10 at 23:10
yes, and the ids are not continuous (and not neccessarily sorted). – etarion Nov 29 '10 at 23:23

George · Answer 3 · 2010-11-30T11:13:09.157

If I understood your problem correctly, you actually try to create a linear time sorting algorithm (subject to the input size of numbers M). That is NOT possible.

Your current approach is to have a sorted list of possible values. This takes linear time to the number of possible values N (theoretically, given that the map search takes O(1) time).

The best you could do, is to sort the values (you found from the map) with a quick sorting method (O(MlogM) f.e. quicksort, mergesort etc) for small values of M and maybe do that linear search for bigger values of M. For example, if N is 100000 and M is 100 it is much faster to just use a sorting algorithm.

I hope you can understand what I say. If you still have questions I will try to answer them :)

edit: (comment) I will further explain what I mean. Say you know that your numbers will range from 1 to 100. You have them sorted somewhere (actually they are "naturally" sorted) and you want to get a subset of them in sorted form. If it would be possible to do it faster than O(N) or O(MlogM), sorting algorithms would just use this method to sort.

F.e. by having the set of numbers {5,10,3,8,9,1,7}, knowing that they are a subset of the sorted set of numbers {1,2,3,4,5,6,7,8,9,10} you still can't sort them faster than O(N) (N = 10) or O(MlogM) (M = 7).

No, i don't want to create a linear sorting time algorithm - i want to get values from an already sorted vector, so no sorting needs to be done. see http://ideone.com/SNHVq for a sample implementation of the algorithm i outlined in the OP. — etarion, Nov 30 '10 at 00:03

How to get a sorted subvector out of a sorted vector, fast

3 Answers3