How can a pathological input exist for an std::unordered_set?

Question

I was solving the basic problem of finding the number of distinct integers in a given array.

My idea was to declare an std::unordered_set, insert all given integers into the set, then output the size of the set. Here's my code implementing this strategy:

#include <iostream>
#include <fstream>
#include <cmath>
#include <algorithm>
#include <vector>
#include <unordered_set>

using namespace std;

int main()
{
    int N;
    cin >> N;
    
    int input;
    unordered_set <int> S;
    for(int i = 0; i < N; ++i){
        cin >> input;
        S.insert(input);
    }
    
    cout << S.size() << endl;

    return 0;
}

This strategy worked for almost every input. On other input cases, it timed out.

I was curious to see why my program was timing out, so I added an cout << i << endl; line inside the for-loop. What I found was that when I entered the input case, the first 53000 or so iterations of the loop would pass nearly instantly, but afterwards only a few 100 iterations would occur each second.

I've read up on how a hash set can end up with O(N) insertion if a lot of collisions occur, so I thought the input was causing a lot of collisions within the std::unordered_set.

However, this is not possible. The hash function that the std::unordered_set uses for integers maps them to themselves (at least on my computer), so no collisions would ever happen between different integers. I accessed the hash function using the code written on this link.

My question is, is it possible that the input itself is causing the std::unordered_set to slow down after it hits around 53000 elements inserted? If so, why?

Here is the input case that I tested my program on. It's pretty large, so it might lag a bit.

Where are you testing this program? I just ran it locally using Visual Studio 2019, and it took a couple of seconds in release mode, where the numbers are read from a file. — PaulMcKenzie, Aug 21 '20 at 02:47
I tested this program on Geany 1.37. I didn't know that different IDEs could cause such different program results... In the future, I'll make sure that I test programs on different IDEs before posting questions about them. — Christopher Miller, Aug 21 '20 at 02:51
I wonder if cache misses are a part of this (which would make it more apparent in a 64 bit build vs a 32 bit build because of the larger pointers). — 1201ProgramAlarm, Aug 21 '20 at 02:53
_so no collisions would ever happen between different integers_ is missing something very important about the way hash tables work. The bucket key isn't the hash (which may indeed not collide), but the key modulo the size of the table. The table is not infinite in size, so collisions will _always_ happen when you get over a fairly small load factor. — Useless, Aug 21 '20 at 17:07

score 23 · Accepted Answer · edited Aug 21 '20 at 13:01

The input file you've provided consists of successive integers congruent to 1 modulo 107897. So what is most likely happening is that, at some point when the load factor crosses a threshold, the particular library implementation you're using resizes the table, using a table with 107897 entries, so that a key with hash value h would be mapped to the bucket h % 107897. Since each integer's hash is itself, this means all the integers that are in the table so far are suddenly mapped to the same bucket. This resizing itself should only take linear time. However, each subsequent insertion after that point will traverse a linked list that contains all the existing values, in order to make sure it's not equal to any of the existing values. So each insertion will take linear time until the next time the table is resized.

In principle the unordered_set implementation could avoid this issue by also resizing the table when any one bucket becomes too long. However, this raises the question of whether this is a hash collision with a reasonable hash function (thus requiring a resize), or the user was just misguided and hashed every key to the same value (in which case the issue will persist regardless of the table size). So maybe that's why it wasn't done in this particular library implementation.

See also https://codeforces.com/blog/entry/62393 (an application of this phenomenon to get points on Codeforces contests).

I really like Abseil philosophy here: they unapologetically use power of 2 sizes, and shrug off the argument of "but it doesn't work with poor hash functions" with the argument that if the hash function is broken, then it's the hash function that need fixing, no the hash table. — Matthieu M., Aug 21 '20 at 14:32
How come the table would resizes to `107897`? It's a power of 2, nor is it close. — Quelklef, Aug 21 '20 at 17:23

selbie · Answer 2 · 2020-08-21T02:52:27.983

Your program works absolutely fine. There is nothing wrong with the hash algorithm, collisions, or anything of the like.

The thottling you are seeing is from the console i/o when you attempt to paste 200000 numbers into the window. That's why it chokes. Redirect from file and it works fine and returns the result almost instantly.

C:\Users\selbie\source\repos\ConsoleApplication126\Debug>ConsoleApplication126.exe  < d:/test.txt
200000

All the numbers in your test input file are unique, so the output is 200000.

How can a pathological input exist for an std::unordered_set?

2 Answers2