Why unordered_multiset works bad for many equal keys

Question

I have this piece of code:

unordered_multiset<int> t;
for (int i = 0; i < 1000000; i++) {
    if (i % 10000 == 0)
        cout << i << endl;

    t.insert(10);
}

So it just puts a lot of equal elements in an unordered_multiset. But I found out that the more elements I have in hash more slower this works? And I cannot realize the reason. In my opinion after applying the hash function and finding equal element's bucket (since all equal elements are grouped together) stl just put them at the end of bucket.

So what's wrong here?

Udp: I found the the description of unordered_multiset::insert function

Single element insertions: Average case: constant. Worst case: linear in container size.

So the question now can be rephrased as: "Why the worst case is linear"

more elements we have in hash more slower this works?so, what operation are we talking about here?? find or insert?? — basav, Oct 05 '15 at 06:59
It's constant amortized time (i.e., `O(1)`) to find the bucket. But the bucket afaik is a linked list. — 101010, Oct 05 '15 at 07:01
We are talking about insertion. How does it work externally? Why the insertion works so slow? — Nikita, Oct 05 '15 at 07:01
"So the inset ion should be fast, isn't it?" Not necessarily. Modern 2-3 level cache memory architecture defy theory. Linked lists is by nature a non cache friendly data structure. — 101010, Oct 05 '15 at 07:04
"Just run this code" doesn't really help, since this is not a complete program and there are no timings or other performance tests. Some compilers might even optimize the insertions away since they're not used anywhere. — Sami Kuhmonen, Oct 05 '15 at 07:10
I am sorry but I cannot consider such measurement as precise. Use timer measurements and show that average insertion time of the same item is significantly different for 100, 1000, 10000, 100000, 1000000 insertions then it would be an interesting point. — Alex Lop., Oct 05 '15 at 07:12
hard to say...These containers have a hash policy which determines when to rehash or resize containers accordingly... _M_rehash_policy._M_need_rehash(_M_bucket_count, _M_element_count, 1); if (__do_rehash.first) { const key_type& __k = this->_M_extract(__v); __n = this->_M_bucket_index(__k, __code, __do_rehash.second); } — basav, Oct 05 '15 at 07:21
https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a00906_source.html .. you will have to swim through this — basav, Oct 05 '15 at 07:23
I would say the worst case is when it has to rehash itself after each insertion? — SingerOfTheFall, Oct 05 '15 at 07:24
For GCC 4.7.2 insertion of 1 million elements on my machine takes 0.347s, 10 million takes 3.4s, 100 million 37.2s - looks pretty linear to me. What compiler/version are you using Nikita? — Tony Delroy, Oct 05 '15 at 07:57

score 1 · Answer 1 · answered Oct 05 '15 at 07:29

1

Everything goes in the same bucket. To put something at the end of the bucket, you have to find the end of the bucket, and the more things in the bucket, the longer that takes.

answered Oct 05 '15 at 07:29

David Schwartz

179,497
17
214
278

Why doesn't the bucket store a pointer to the last list node for faster insertion? – dyp Oct 05 '15 at 08:34
Why should it? The performance constraints don't require it, permitting insertion to take linear time. And it would have costs on operations that are quite common, such as insertion of values with unique keys. Every instance would pay the cost of such a pointer and only ones where there were an insertion of significant numbers of equal keys would benefit. And those cases can optimize the operation themselves, for example, by using a map to a list of values. – David Schwartz Oct 05 '15 at 08:39
Yeah, my question was indirectly aimed at the performance constraints. That is, I couldn't find the rationale *why* insertion is allowed to take linear time. I agree that overhead is probably the reason; but using a map makes the list (or indirection) unnecessary. – dyp Oct 05 '15 at 08:48
1

@dyp Most likely because it's possible that all the values will hash the same but be unequal. In that case, what else can insertion do but compare the newly-inserted element to every previous element? – David Schwartz Oct 05 '15 at 08:51
I don't quite get your last point: Why should comparison be necessary for insertion into a *multiset*? Couldn't it just append the new element to the bucket, possibly triggering rehashing? – dyp Oct 05 '15 at 09:06
2

@dyp It has to know whether the entry has the same value as other entries in the bucket first. Otherwise, it can't meet the ordering constraints that equal valued keys sort together. – David Schwartz Oct 05 '15 at 09:07
That, I think, is a crucial part missing in your answer (and which I was not aware of, thanks). This additional guarantee requires some overhead, and linear time worst case makes the common use case of unique elements have little overhead. – dyp Oct 05 '15 at 09:14

score 0 · Answer 2 · answered Oct 05 '15 at 07:28

The container tries to balance itself by reorganizing the storage so that the average bucket size is below the load_factor. It does this by adding more buckets with the hope that the data will be more evenly distributed.

When you store the same value in all elements, they will end up in the same bucket anyway. Worst possible condition for a hash table!

Why unordered_multiset works bad for many equal keys

2 Answers2