22

I have a simple std::vector containing some numbers, which are sorted (in ascending order). I want to lookup an element, so far I use:

return std::lower_bound(vec.begin(), vec.end(), needle);

Where needle is the element I look for. However, my vector tends to be quite long (millions of elements), but most of the time the contents are relatively predictable in a sense that if the first element is zero and the last element is N, then the elements in between have value close to (N * index) / vec.size() and are hence predictable.

Is there a modification of the lower bound, which would accept a hint (similarly to how std::map::emplace_hint() does), such as:

assert(!vec.empty());
std::vector<int>::iterator hint = vec.begin() + std::min(vec.size() - 1,
    (needle * vec.size()) / vec.back());
if(*hint > needle)
    return std::lower_bound(vec.begin(), hint, needle);
else
    return std::lower_bound(hint, vec.end(), needle);

This will work, but the lower_bound ignores that it is close to the solution and will most likely start splitting the interval to halves (looking where we know that the needle most likely isn't), taking unnecessarily many steps. I know that there was an algorithm which starts with step 1, which it doubles until it overshoots the needle, and then does binary search in the given interval.

I forgot what is the name of the algorithm. Is it implemented in the STL?

the swine
  • 10,713
  • 7
  • 58
  • 100
  • If your container is sorted, why don't you use std::set or std::multiset? It will use a much better searching algorithm than std::lower_bound() – ChrisWard1000 Oct 28 '14 at 16:12
  • 2
    @ChrisWard1000 vector will have better cache performance. – Neil Kirk Oct 28 '14 at 16:13
  • 1
    @ChrisWard1000 The reason can be you want a cheap insert. – BartoszKP Oct 28 '14 at 16:13
  • 1
    @ChrisWard1000 because insertion to `set` is amortized, and the preceding algorithm generates the elements already sorted, so for insertion `vector` is IMO significantly cheaper. – the swine Oct 28 '14 at 16:13
  • 3
    @ChrisWard1000 why do you think `std::set` uses a better algorithm than `std::lower_bound`? They're both O(log n) and the `vector` typically has better performance. – Mark Ransom Oct 28 '14 at 16:25
  • 1
    I think you almost hint at a solution yourself; in the code you gave, adjust both the begin and end iterators to be symmetric about hint; if that search succeeds, you're done; otherwise resort to "regular" lower_bound; this way, if the hint is near the ends of the vector, you will search a very narrow range, and if the hint is right in the middle, you'd just search the complete range. – Stefan Atev Oct 28 '14 at 17:53
  • @StefanAtev well, yes, I could do that, but these algorithms are [extremely tricky to write](http://googleresearch.blogspot.cz/2006/06/extra-extra-read-all-about-it-nearly.html), I'd rather use a standard implementation, or at least an algorithm with proven complexity bounds. – the swine Oct 29 '14 at 09:25
  • 2
    I think name of the algorithm is ["one-sided binary search"](http://en.wikipedia.org/wiki/Binary_search_algorithm#Exponential_search). It is not in standard C++ library. – Evgeny Kluev Oct 29 '14 at 11:02
  • @EvgenyKluev yes, that is indeed correct, that is the algorithm that I was looking for. Are there any open-source implementations of it? – the swine Oct 29 '14 at 11:10
  • I've never seen one. Probably it is easier to implement from scratch than to find an implementation... – Evgeny Kluev Oct 29 '14 at 11:23

1 Answers1

28

I think the algorithm you're looking for is called interpolation search which is a variation on binary search that, instead of looking at the midpoint of the array, linearly interpolates between the array endpoints to guess where the key should be. On data that's structured the way that yours is, the expected runtime is O(log log n), exponentially faster than a standard binary search.

There is no standard implementation of this algorithm in C++, but (as a totally shameless plug) I happened to have coded this one up in C++. My implementation is available online if you're interested in seeing how it works.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • I see. This is not the algorithm I meant, I did not know it before (great answer!). I wanted to use only one step of the interpolation search and then continue with some kind of binary that I forgot the name of. How does this algorithm behave in case the array is not "linearly interpolated" (what is the worst case)? Also I can see why this is not in STL, as this algorithm can only work with numbers, it would be difficult (if not impossible) to make it work with generic types for which comparison is defined (such as strings). Am I right? – the swine Oct 28 '14 at 16:22
  • 3
    @theswine In the worst case, the runtime of this algorithm will be O(n). That happens only if the data are exponentially increasing, which isn't at all likely to happen in practice. I think yours right that the reason this was left out is that it's difficult to make this work with non-numeric types, though you could imagine requiring some sort of client-specified interpolation function as a final parameter. – templatetypedef Oct 28 '14 at 16:25
  • @theswine No worries! If you had found a bug, I definitely would have wanted to know about it. It's pretty cool that you're putting this on a GPU! – templatetypedef Nov 17 '14 at 18:33