Why unorded_set rehash complexity could be O(n^2) in worst case?

Question

I don't understand why it is not linear.

There is a good answer for similar question about multiset: why hastable's rehash complexity may be quadratic in worst case

But what about set? It could be only one element for each key.

Update:

Many keys in one bucket is also not a problem. We can go through them in a linear time.

I think the right answer as mentioned below is that O(n^2) rehash complexity included in the standard to allow open adressing(and may be some other) implementation.

Even if there is only one element for each key, there may be multiple keys with identical hash values, so the answer is the same. — Mankarse, Jun 10 '14 at 09:59
@Mankarse: You are wrong, see my answer. Well, you are right that there are multiple keys with identitcal hash values. However, this does not imply `O(n²)` rehashing behaviour. — gexicide, Jun 10 '14 at 10:26
The linked question has a pretty good answer. Which part is confusing you? Which part is unclear? Please add more details to your question. — Ali, Jun 10 '14 at 10:32
@gexicide: It is certainly possible to write such a container, but the standard does not require that implementations do so. Worst case rehash performance is `O(n²)` (see `[unord.req]/10` Table 103). — Mankarse, Jun 10 '14 at 10:33
@Mankarse: Right, this is true and basically *the* correct answer. There is such a `O(n)` algorithm, but the standard doesn't require to use it. The standard doesn't require `O(n)` so every `O(n²)` implementation is fine. Full stop. Spec has spoken. But still, one could wish for it :). — gexicide, Jun 10 '14 at 10:43
@gexicide My guess is that they didn't implement it the way you write in your answer because of the significant additional space overhead. After all, if you have such a bad hash function that maps all the elements into the same bucket, you really deserve the `O(n²)` complexity. Don't guard against this worst case, but go back and fix your hash function. In more practical cases, assuming a reasonable hash function, this won't happen, so there is no point in guarding against this worst case complexity. — Ali, Jun 10 '14 at 10:58
@Ali: You are right for this question. However, the link talks about multisets and in that scenario, it can happen quite often that there are a lot of duplicates in the multiset. And the space overhead is not *that* big. It is one additional pointer per bucket. But true, it *is* some additional space overhead, so this might be the reason why this algorithm was not chosen for `stl`. — gexicide, Jun 10 '14 at 11:01
@gexicide Unfortunately I don't know how `unordered_multiset` is implemented, how it handles the duplicate keys. As for the space overhead, it depends on the implementation, but I can easily come up with a case where you use 1.67x-2.0x more space than necessary if you also store a pointer to the last element. That is significant; if you are on a 32 bit machine with 4GB RAM, it can be a problem. All in all, in case of `unordered_set`, I see no point in guarding against this corner case; the hash function should be fixed in the first place. — Ali, Jun 10 '14 at 11:11
@gexicide Having said that, you are right, the worst case `O(n²)` complexity can be avoided. — Ali, Jun 10 '14 at 11:11
@Ali: Check the edit of my answer. I have conducted a memory analysis. The maximum possible overhead is **53%** for one byte objects. For eight byte objects it is **43%**. So it is not that high, but indeed may be relevant in memory-constrainted environments. — gexicide, Jun 10 '14 at 11:24
@gexicide OK, so say, at a cost of 50% memory overhead, we can guard against a situation which should (practically) never happen in the first place with a half-sane hash function. I suggest ending this discussion, we are not helping anyone. — Ali, Jun 10 '14 at 11:36
@Ali: Check the edit at the bottom of my post. It is possible to rehash sets and multisets without needing additional space overhead in `O(n)` — gexicide, Jun 10 '14 at 11:40

gexicide · Accepted Answer · 2014-06-10T11:55:10.057

3

Basically, it would be possible to build a hash set that has O(n) worst case rehash time. It would even be possible to build a multiset with this property that still grants the same guarantee that elements with the same key are behind each other in a bucket, so the link you state is wrong. (Well, not fully wrong, it admits that there may be O(n) implementations)

It works like this:

for each bucket b of old table
    for each element e in bucket b
        b' = new bucket of e
        prepend e before the first entry in b' // <---- this is the key optimization

The algorithm works for sets and multisets and is extremely simple. Instead of appending elements to the new bucket (which would be O(number of elements in the bucket)) we prepend to the bucket which is O(1) (simply change two pointers).

Of course, this will reverse the elements in the bucket. But this is okay since the key multimap assumption that equal elements are behind each other still holds.

But beware of open addressing

My solution only works for hashing with chaining. It does not work for open addressing. Thus, since the spec surely wants to allow both implementation methods, it must state that O(n²) might be the worst case, even if there are implementations that have better asymptotic runtimes.

edited Jun 10 '14 at 11:55

answered Jun 10 '14 at 10:21

gexicide

38,535
21
92
152

I removed my answer since it still didn't show the need for quadratic. The point was, that the key isn't required to be sortable, but it appears sorting isn't even required for this algorithm to work. I guess the quadratic worst case is so just to allow implementations with minimal memory overhead. – eerorika Jun 10 '14 at 11:40
@user2079303: Yes, I also realized this now. With this, we can even get `O(n)` without needing any extra space. See the last section of my edited answer. – gexicide Jun 10 '14 at 11:41
@gexicide Okay. But it would only work with chaining; it wouldn't work with open addressing. But I am begging you: *Why are we guarding against a corner case that should never happen in the first place?* – Ali Jun 10 '14 at 11:46
@Ali: Right, open addressing won't work. And again right, for a `set` it is a corner case. But it is not a corner case for a `multiset`. Okay, the OP asked about set, but he quoted a link for a multiset and it is even possible for this data structure. But you are probably right, the spec wants to ensure that possible implementations may also use open addressing. I will add the open addressing stuff to my answer. – gexicide Jun 10 '14 at 11:49
Given the bucket interface and the requirement for stable references on insert, I doubt the spec wants to allow open addressing. – PlasmaHH Jun 10 '14 at 12:52

Why unorded_set rehash complexity could be O(n^2) in worst case?

1 Answers1

But beware of open addressing