70

Often, it is more efficient to use a sorted std::vector instead of a std::set. Does anyone know a library class sorted_vector, which basically has a similar interface to std::set, but inserts elements into the sorted vector (so that there are no duplicates), uses binary search to find elements, etc.?

I know it's not hard to write, but probably better not to waste time and use an existing implementation anyway.

Update: The reason to use a sorted vector instead of a set is: If you have hundreds of thousands of little sets that contain only 10 or so members each, it is more memory-efficient to just use sorted vectors instead.

Frank
  • 64,140
  • 93
  • 237
  • 324
  • Could you maybe be more specific about what in std::set isn't efficient enough? – KillianDS Apr 25 '10 at 22:28
  • If you have hundreds of thousands of little sets that contain only 10 or so members each, it is more memory-efficient to just use sorted vectors instead. – Frank Apr 25 '10 at 22:31
  • 2
    I don't think there's a ready-made class for that. You may write your own or use `lower_bound()` for insertion and `binary_search()` for lookup. –  Apr 25 '10 at 22:36
  • 6
    If the vectors are so small, the difference between binary and sequential search is likely to be small too, so you may as well just use a std::vector. –  Apr 25 '10 at 22:45
  • 4
    The difference will probably be quite large because of the cache misses that the set will incur. – Neil G Apr 25 '10 at 23:14
  • I am writing this container. I should have it done in a week (with StackOverflow's help! :) Where is the best place to share this code? – Neil G May 12 '10 at 21:54
  • @Neil G: Maybe upload to google code or to github and post the link right here? – Frank Sep 27 '10 at 13:54
  • @Frank: It's here: http://stackoverflow.com/questions/3125905/sparse-vector-template-class-how-do-i-clean-it-up . Please let me know if you make improvements. – Neil G Sep 28 '10 at 02:32
  • 2
    @Frank: I'm a bit late to this question, but anyway :) You should check if binary search in a sorted vector of "10 or so" elements is any faster than just a linear search. It is quite possible that it isn't faster, or it could even be slower, as processor's branch prediction will play an important role in this case. – Roman L Feb 01 '11 at 21:37
  • 1
    Related paper by Matt Austern: [Why You Shouldn't Use set, and What You Should Use Instead](http://lafstern.org/matt/col1.pdf). – legends2k Nov 25 '14 at 16:07

6 Answers6

34

Boost.Container flat_set

Boost.Container flat_[multi]map/set containers are ordered-vector based associative containers based on Austern's and Alexandrescu's guidelines. These ordered vector containers have also benefited recently with the addition of move semantics to C++, speeding up insertion and erasure times considerably. Flat associative containers have the following attributes:

  • Faster lookup than standard associative containers
  • Much faster iteration than standard associative containers.
  • Less memory consumption for small objects (and for big objects if shrink_to_fit is used)
  • Improved cache performance (data is stored in contiguous memory)
  • Non-stable iterators (iterators are invalidated when inserting and erasing elements)
  • Non-copyable and non-movable values types can't be stored
  • Weaker exception safety than standard associative containers (copy/move constructors can throw when shifting values in erasures and insertions)
  • Slower insertion and erasure than standard associative containers (specially for non-movable types)

Live demo:

#include <boost/container/flat_set.hpp>
#include <iostream>
#include <ostream>

using namespace std;

int main()
{
    boost::container::flat_set<int> s;
    s.insert(1);
    s.insert(2);
    s.insert(3);
    cout << (s.find(1)!=s.end()) << endl;
    cout << (s.find(4)!=s.end()) << endl;
}

jalf: If you want a sorted vector, it is likely better to insert all the elements, and then call std::sort() once, after the insertions.

boost::flat_set can do that automatically:

template<typename InputIterator> 
flat_set(InputIterator first, InputIterator last, 
         const Compare & comp = Compare(), 
         const allocator_type & a = allocator_type());

Effects: Constructs an empty set using the specified comparison object and allocator, and inserts elements from the range [first, last).

Complexity: Linear in N if the range [first, last) is already sorted using comp and otherwise N*log(N), where N is last - first.

Evgeny Panasyuk
  • 9,076
  • 1
  • 33
  • 54
10

The reason such a container is not part of the standard library is that it would be inefficient. Using a vector for storage means objects have to be moved if something is inserted in the middle of the vector. Doing this on every insertion gets needlessly expensive. (On average, half the objects will have to be moved for each insertion. That's pretty costly)

If you want a sorted vector, it is likely better to insert all the elements, and then call std::sort() once, after the insertions.

jalf
  • 243,077
  • 51
  • 345
  • 550
  • I dont see how that would solve the problem. All the objects still have to be touched, even if it is only a pointer swap. You're still trying to do something that the data structure just isn't suited for. – jalf Apr 26 '10 at 01:41
  • 9
    I started writing an answer like that, and stopped because it's simply not really true. For less than a few dozen elements, which is pretty common really, moving on average half can easily be less expensive than performing an allocation and a tree rebalance. Of course it's better to call `sort` once, and I personally wouldn't look for a container to do this, but it's a matter of style. – Potatoswatter Apr 26 '10 at 03:06
  • 2
    Inserting n elements into a sorted array is log n to find the insertion point and n/2 to move the existing elements, for n elements. O(n*n*log n), not efficient at all. Might work out if n is small enough though. – Mark Ransom Apr 26 '10 at 03:33
  • @Potatoswatter: Replacing it with a node-based datastructure wasn't my suggested alternative though. Like you say, the heap allocations and tree rebalancing gets pricey too (although a custom allocator could help somewhat). Sorting once, at the end, was my suggestion. – jalf Apr 26 '10 at 19:30
  • I suggest a combination of std::map and std::vector as a solution. – Rampal Chaudhary Jan 17 '14 at 05:32
  • Downvote: [Matt Austern](http://lafstern.org/matt/col1.pdf) gives a very clear use case for this, and is also clearer on why it is not part of STL (the expensive insert/delete). Stating that such an implementation is inefficient is not correct. Only a few operations are more inefficient versus associative containers. Others perform better. And as he mentions, the expensive ones often don't matter. Downvoted because the answer doesn't really address the question, nor distinguishes the efficiency by operation, nor mention boost's perfectly reasonable implementation. – Cookie May 17 '14 at 11:50
  • @Cookie did you by any chance read the part where I suggest you can sort the vector *when you need it to be sorted*, thus avoiding the needless overhead of sorting between *every* insertion, *while keeping all the advantages of a sorted vector*? When I say that a `sorted_vector` data structure would be inefficient, I mean in comparison to a regular vector that is sorted on an as-needed basis. The latter has all the advantages of a sorted vector, without paying a needless cost on insertion. Thus the former is inefficient. – jalf May 19 '14 at 10:50
  • Here, another downvote. There's definitely a case for a sorted vector behaving as a set or an associative container in terms of its interface, that has peculiar performance characteristics (very expensive insert/update if you want), and this is demonstrated by the very existence of boost::flat_map and alike. This answer adds no value. – gd1 Nov 22 '16 at 19:37
  • inefficient in 2017 – Unicorn Jun 28 '17 at 18:15
  • This answer ignores the obvious fact that abstract data types exist so that one can hide implementation details, and "do the right thing" which, in the case this answer posits, is to simply append the inserted value to the end of the vector until such time as it is read by a method that requires it be sorted. For instance, a method that asks for the length of the vector does not require that the vector be sorted just yet. However, taking the median value of, say, a vector of numbers would require that the vector be in a sorted state and, if it is not, sort it at that time. – user3673 Dec 03 '17 at 18:47
  • This depends strongly on how it's going to be used. If you need to do inserts sporadically and rarely, but you require very fast reading, a vector can be faster even for large lists, then say a binary tree set would be. – Jeroen May 17 '18 at 18:57
  • Of course insertion cost may not matter as much if you are searching many orders of magnitude more than you are inserting. In which case vector post-sorted would still work, but maybe the binary search would be less syntactically sugared. I admit I haven't thought about it too deeply. – Bill - K5WL Jun 21 '18 at 20:41
5

I think there's not 'sorted container' adapter in the STL because there are already the appropriate associative containers for keeping things sorted that would be appropriate to use in nearly all cases. To be honest, about the only reason I can think of off the top of my head for having a sorted vector<> container might be to interoperate with C functions that expect a sorted array. Of course, I may be missing something.

If you feel that a sorted vector<> would be more appropriate for your needs (being aware of the shortcomings of inserting elements into a vector), here's an implementation on Code Project:

I've never used it, so I can't vouch for it (or its license - if any is specified). But a quick read of the article and it looks like the author at least made a good effort for the container adapter to have an appropriate STL interface.

It seems to be worth a closer look.

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • 1
    A sorted vector is likely to be faster until the set gets fairly big (100's of elements). Sets have *horrible* cache-locality. – Martin Bonner supports Monica Jan 10 '18 at 10:19
  • the thing such class would have unlike associative containers is possibility of keeping duplicated elements. in key based containers it would require to use sumething like multimap for that case and it is iterated in a drastically different way than vector linear iteration – Andrew Apr 21 '23 at 09:08
4

If you decide to roll your own, you might also want to check out boost:ublas. Specifically:

#include <boost/numeric/ublas/vector_sparse.hpp>

and look at coordinate_vector, which implements a vector of values and indexes. This data structure supports O(1) insertion (violating the sort), but then sorts on-demand Omega(n log n). Of course, once it's sorted, lookups are O(logn). If part of the array is sorted, the algorithm recognizes this and sorts only the newly added elements, then does an inplace merge. If you care about efficiency, this is probably the best you can do.

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
Neil G
  • 32,138
  • 39
  • 156
  • 257
3

Alexandresu's Loki has a sorted vector implementation, if you dont want to go through the relativley insignicant effort of rolling you own.

http://loki-lib.sourceforge.net/html/a00025.html

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
Lance Diduck
  • 1,513
  • 9
  • 11
1

Here is my sorted_vector class that I've been using in production code for years. It has overloads to let you use a custom predicate. I've used it for containers of pointers, which can be a really nice solution in a lot of use cases.

moodboom
  • 6,225
  • 2
  • 41
  • 45