27

So originally, I had this code:

import java.util.*;

public class sandbox {
    public static void main(String[] args) {
        HashSet<Integer> hashSet = new HashSet<>();
        for (int i = 0; i < 100_000; i++) {
            hashSet.add(i);
        }

        long start = System.currentTimeMillis();

        for (int i = 0; i < 100_000; i++) {
            for (Integer val : hashSet) {
                if (val != -1) break;
            }

            hashSet.remove(i);
        }

        System.out.println("time: " + (System.currentTimeMillis() - start));
    }
}

It takes around 4s to run the nested for loops on my computer and I don't understand why it took so long. The outer loop runs 100,000 times, the inner for loop should run 1 time (because any value of hashSet will never be -1) and the removing of an item from a HashSet is O(1), so there should be around 200,000 operations. If there are typically 100,000,000 operations in a second, how come my code takes 4s to run?

Additionally, if the line hashSet.remove(i); is commented out, the code only takes 16ms. If the inner for loop is commented out (but not hashSet.remove(i);), the code only takes 8ms.

scdivad
  • 421
  • 3
  • 10
  • 4
    I confirm your findings. I could speculate about the reason, but hopefully someone clever will post a fascinating explanation. – khelwood Dec 29 '19 at 18:39
  • 1
    It looks like the `for val` loop is the thing taking up the time. The `remove` is still very fast. Some kind of overhead setting up a new iterator after the set has been modified...? – khelwood Dec 29 '19 at 18:45
  • @apangin provided a good explanation in https://stackoverflow.com/a/59522575/108326 for why the `for val` loop is slow. However, note that the loop isn't needed at all. If you want to check whether there are any values different from -1 in the set, it would be much more efficient to check `hashSet.size() > 1 || !hashSet.contains(-1)`. – markusk Jan 02 '20 at 22:08

1 Answers1

32

You've created a marginal use case of HashSet, where the algorithm degrades to the quadratic complexity.

Here is the simplified loop that takes so long:

for (int i = 0; i < 100_000; i++) {
    hashSet.iterator().next();
    hashSet.remove(i);
}

async-profiler shows that almost all time is spent inside java.util.HashMap$HashIterator() constructor:

    HashIterator() {
        expectedModCount = modCount;
        Node<K,V>[] t = table;
        current = next = null;
        index = 0;
        if (t != null && size > 0) { // advance to first entry
--->        do {} while (index < t.length && (next = t[index++]) == null);
        }
    }

The highlighted line is a linear loop that searches for the first non-empty bucket in the hash table.

Since Integer has the trivial hashCode (i.e. hashCode is equal to the number itself), it turns out that consecutive integers mostly occupy the consecutive buckets in the hash table: number 0 goes to the first bucket, number 1 goes to the second bucket, etc.

Now you remove the consecutive numbers from 0 to 99999. In the simplest case (when the bucket contains a single key) the removal of a key is implemented as nulling out the corresponding element in the bucket array. Note that the table is not compacted or rehashed after removal.

So, the more keys you remove from the beginning of the bucket array, the longer HashIterator needs to find the first non-empty bucket.

Try to remove the keys from the other end:

hashSet.remove(100_000 - i);

The algorithm will become dramatically faster!

apangin
  • 92,924
  • 10
  • 193
  • 247
  • 1
    Ahh, I came across this but dismissed it after the first few runs and thought this might be some JIT optimization and moved to analysing via JITWatch. Should have run async-profiler first. Damn! – Adwait Kumar Dec 29 '19 at 19:29
  • 1
    Pretty interesting. If you do something like the following in the loop, it speeds it up by reducing the size of the inner map: `if (i % 800 == 0) { hashSet = new HashSet<>(hashSet); }`. – Gray Jan 02 '20 at 06:23