Insert a sorted range into std::set with hint

Question

Assume I have a std::set (which is by definition sorted), and I have another range of sorted elements (for the sake of simplicity, in a different std::set object). Also, I have a guarantee that all values in the second set are larger than all the values in the first set.

I know I can efficiently insert one element into std::set - if I pass a correct hint, this will be O(1). I know I can insert any range into std::set, but as no hint is passed, this will be O(k logN) (where k is number of new elements, and N number of old elements).

Can I insert a range in a std::set and provide a hint? The only way I can think of is to do k single inserts with a hint, which does push the complexity of the insert operations in my case down to O(k):

std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};  

for(auto bigElem : biggerSet)
    bigSet.insert(bigSet.end(), bigElem);

Why would k hinted insertions of complexity O(1) result in an O(kN) complexity operation? Isn't it O(k) -pun not intended? — papagaga, Apr 12 '18 at 14:59
@papagaga Oh, well, that's because `k` times `1` equals `kN` for sufficiently large values of `1` :) Sorry, typo, I edited it out. — penelope, Apr 12 '18 at 15:11
I wonder if `std::set` is what you're looking for. Using it to keep sorted collections isn't a good idea. You use `std::set` when you need to rely on the property that insertions don't invalidate iterators and references. To keep sorted collections, `std::vector` and `` generally are the best choice. And what you ask for is a no brainer on two `vector`s — papagaga, Apr 12 '18 at 15:13
@papagaga I am actually working with `std::map`s, which represent, in a way, very sparse histograms. I calculate several of these histograms (about 10, not thousands). The calculation is the complicated bit, where I need to add and update many, many elements out of order, so I need find/insert O(logN) and the `std::map` is the correct data structure for this. After each calculation, I need to update the "globalHistogram" by adding the information of the current histogram to it which is what my question refers to. I asked the question with `std::set` for simplicity instead. — penelope, Apr 12 '18 at 15:48
@papagaga Currently my construction is O(NlogN). Working with a vector, I would need to at least std::find before every insertion to check for duplicates (80%+ of insertions are duplicates; so keeping everything and then filtering unique is horrible memory-wise), and sort it in the end - O(N^2 + NlogN). I would also need to std::find every time I needed to read a value from the sparse histogram, which is again not very ideal. — penelope, Apr 12 '18 at 15:54
@papagaga So while I know about the whole "without giving context you sometimes ask one question, but then the context reveals you want to know a different thing", I am really, truly, only interested to know if there is a `insert` operation for `std::set` and `std::map` that inserts a range and still takes a hint iterator. I am working on `std::map`s, I have a library (not mine) of things processing this particular histogram structure. I would just like to occasionally merge them together, in a particular way, then continue processing it with all the functions I have available. — penelope, Apr 12 '18 at 16:00
I certainly didn't mean to undermine your questioning. That said, `std::find` isn't what you would use with `std::vector`. You would keep your vectors sorted, and use `std::lower_bound` to check for a duplicate and find the position to insert your new value -that's also `log N` but probably a "better" `log N` because of the contiguous layout in memory. With only 20% of actual insertions (insertions would be the somewhat disadvantageous operation), I wouldn't be surprised with vectors performing quite well. — papagaga, Apr 12 '18 at 16:19
This is an excellent question. My guess is that the answer is no. That said, while the O(k log N) complexity of the range insertion may seem worse than the O(k) complexity of your insert-one-at-a-time solution, it's possible that in practice it might run faster because there may be efficiencies in the memory allocation and possibly in the number of re-balancing operations. — Adrian McCarthy, Apr 12 '18 at 16:33
@AdrianMcCarthy Thinking about it, I also have a feeling that the answer is "probably not" - I have a specific case of all the new elements being inserted at the end and taking the same hint. But in the general case, a newly inserted range could possibly be not continuous in the old container so the usefulness of the `hint` decreases with every element of the range (i.e. set1={10,20,30}, set2 = {15,25,25} - now the optimal `hint` is quite different for each of the three insertions). But it really would be interesting to compare if my one-by-one O(k) really is faster then all-at-once O(k logN) — penelope, Apr 12 '18 at 16:41
To answer your specific question, no there is no such insert operation listed in the documentation. Insert with hint is only faster if the inserted node can go right next to (before?) the hint in most implementations. Note also that inserting a sorted series of nodes into a red-black tree in order is not the most efficient way, as it could potentially cause numerous rebalancing operations. The insert iterator could be optimised to rebalance at the end, I guess, but I don't think it is. — Gem Taylor, Apr 12 '18 at 17:58

Jerry Coffin · Answer 1 · 2018-04-13T04:18:28.190

First of all, to do the merge you're talking about, you probably want to use set (or map's) merge member function, which will let you merge some existing map into this one. The advantage of doing this (and the reason you might not want to, depending your usage pattern) is that the items being merged in are actually moved from one set to the other, so you don't have to allocate new nodes (which can save a fair amount of time). The disadvantage is that the nodes then disappear from the source set, so if you need each local histogram to remain intact after being merged into the global histogram, you don't want to do this.

You can typically do better than O(log N) when searching a sorted vector. Assuming reasonably predictable distribution you can use an interpolating search to do a search in (typically) around O(log log N), often called "pseudo-constant" complexity.

Given that you only do insertions relatively infrequently, you might also consider a hybrid structure. This starts with a small chunk of data that you don't keep sorted. When you reach an upper bound on its size, you sort it and insert it into a sorted vector. Then you go back to adding items to your unsorted area. When it reaches the limit, again sort it and merge it with the existing sorted data.

Assuming you limit the unsorted chunk to no larger than log(N), search complexity is still O(log N)--one log(n) binary search or log log N interpolating search on the sorted chunk, and one log(n) linear search on the unsorted chunk. Once you've verified that an item doesn't exist yet, adding it has constant complexity (just tack it onto the end of the unsorted chunk). The big advantage is that this can still easily use a contiguous structure such as a vector, so it's much more cache friendly than a typical tree structure.

Since your global histogram is (apparently) only ever populated with data coming from the local histograms, it might be worth considering just keeping it in a vector, and when you need to merge in the data from one of the local chunks, just use std::merge to take the existing global histogram and the local histogram, and merge them together into a new global histogram. This has O(N + M) complexity (N = size of global histogram, M = size of local histogram). Depending on the typical size of a local histogram, this could pretty easily work out as a win.

alfC · Answer 2 · 2018-04-14T00:22:09.333

You can merge the sets more efficiently using special functions for that.

In case you insist, insert returns information about the inserted location.

iterator insert( const_iterator hint, const value_type& value );

Code:

std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};  

auto hint = bigSet.cend();
for(auto& bigElem : biggerSet)
    hint = bigSet.insert(hint, bigElem);

This assumes, of course, that you are inserting new elements that will end up together or close in the final set. Otherwise there is not much to gain, only the fact that since the source is a set (it is ordered) then about half of the three will not be looked up.

There is also a member function template< class InputIt > void insert( InputIt first, InputIt last );. That might or might not do something like this internally.

score 0 · Answer 3 · answered Apr 13 '18 at 09:00

Merging two sorted containers is much quicker than sorting. It's complexity is O(N), so in theory what you say makes sense. It's the reason why merge-sort is one of the quickest sorting algorithms. If you follow the link, you will also find pseudo-code, what you are doing is just one pass of the main loop.
You will also find the algorithm implemented in STL as std::merge. This takes ANY container as an input, I would suggest using std::vector as default container for new element. Sorting a vector is a very fast operation. You may even find it better to use a sorted-vector instead of a set for output. You can always use std::lower_bound to get O(Nlog(N)) performance from a sorted-vector.
Vectors have many advantages compared with set/map. Not least of which is they are very easy to visualise in a debugger :-)

(The code at the bottom of the std::merge shows an example of using vectors)

Insert a sorted range into std::set with hint

3 Answers3