2

I am trying to get a better understanding how the internas of hashed sets, e.g. HashSet<T> do work and why they are performant. I discovered following article, implementing a simple example with a bucket list http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/.

As far as I understand this article (and I also thought that way before), the bucket list itself groups certain amount of elements in each bucket. One bucket is represented by the hashcode, namely by GetHashCode which is called on the element. I thought the better performance is based on the fact that there are less buckets than elements.

Now I have written following naive test-code:

    public class CustomHashCode
    {
        public int Id { get; set; }

        public override int GetHashCode()
        {
            //return Id.GetHashCode(); // Way better performance
            return Id % 40; // Bad performance! But why?
        }


        public override bool Equals(object obj)
        {
            return ((CustomHashCode) obj).Id == Id;
        }

    }

And here the profiler:

    public static void TestNoCustomHashCode(int iterations)
    {

        var hashSet = new HashSet<NoCustomHashCode>();
        for (int j = 0; j < iterations; j++)
        {
            hashSet.Add(new NoCustomHashCode() { Id = j });
        }

        var chc = hashSet.First();
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (int j = 0; j < iterations; j++)
        {
            hashSet.Contains(chc);
        }
        stopwatch.Stop();

        Console.WriteLine(string.Format("Elapsed time (ms): {0}", stopwatch.ElapsedMilliseconds));
    }

My naive thought was: Let's reduce the amount of buckets (with a simple modulo), that should increase performance. But it is terrible (on my system it takes about 4 seconds with 50000 iterations). I also thought if I simply return the Id as hashcode, performance should be poor since I would end up with 50000 buckets. But the opposite is the case, I guess I simply produced tones of so called collisions instead of improving anything. But then again, how do the bucket lists work?

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
sl3dg3
  • 5,026
  • 12
  • 50
  • 74
  • 3
    Why do you think a low number of buckets is good? It's preferable to have about one entry per bucket, that's why `HashSet` etc. take your hashcode modulo the capacity of the collection. If you have 50k items but only 50 buckets, each operation requires a sequential search through a linked list of 1000 items => slow – CodesInChaos Dec 12 '12 at 10:37
  • Ideally a hashcode should be a shortcut to equality not some less specific "bucket" identifier. All items in the "bucket" should be equal. – Jodrell Dec 12 '12 at 10:43
  • Right - I think I wrongly assumed that the lookup in the bucket list itself is expensive too, which must be nonsense. – sl3dg3 Dec 12 '12 at 10:46
  • It has a price, but its lower than a one by one enumeration. Essentially, thats the point of the `HashSet`. – Jodrell Dec 12 '12 at 10:48

3 Answers3

4

A Contains check basically:

  1. Gets the hashcode of the item.
  2. Finds the corresponding bucket - this is a direct array lookup based on the hashcode of the item.
  3. If the bucket exists, tries to find the item in the bucket - this iterates over all the items in the bucket.

By restricting the number of buckets, you've increased the number of items in each bucket, and thus the number of items that the hashset must iterate through, checking for equality, in order to see if an item exists or not. Thus it takes longer to see if a given item exists.

You've probably decreased the memory footprint of the hashset; you may even have decreased the insertion time, although I doubt it. You haven't decreased the existence-check time.

Rawling
  • 49,248
  • 7
  • 89
  • 127
  • I doubt that it improved the memory footprint. The buckets get allocated, even when they're empty. – CodesInChaos Dec 12 '12 at 10:42
  • So the only difference in performance is actually the fact that the lookup in the bucket itself is much faster? – sl3dg3 Dec 12 '12 at 10:45
  • No, the lookup in the bucket is _slower_. @Codes I'm not convinced the default constructor creates _any_ buckets, but I could well be wrong. – Rawling Dec 12 '12 at 10:46
  • But then again I don't get it: I end up with a bucket list of 50000 entries where I have to lookup an entry, which is, according to your saying, slower? That was my initial thought: let's reduce the amount of buckets... I still don't understand where the actual fast indexing-magic happens. – sl3dg3 Dec 12 '12 at 10:50
  • 1
    @sl3dg3 The finding-of-the-right-bucket is quick; it's an O(1) operation. The finding-of-the-exact-element-in-the-buckt is slow; it's an O(numbero f items in bucket) operation. Thus you want to maximise the number of buckets and minimize the number of items in each. – Rawling Dec 12 '12 at 10:52
  • Step 2. (looking up the bucket) is fast, Step3. (enumerating for equality) is slower. – Jodrell Dec 12 '12 at 10:53
1

Reducing the number of buckets will not increase the performance. Actually, the GetHashCode method of Int32 returns the integer value itself, which is ideal for the performance as it will produce as many buckets as possible.

The thing that gives a hash table performance, is the conversion from the key to the hash code, which means that it can quickly elliminate most of the items in the collection. The only items it has to consider is the ones in the same bucket. If you have few buckets, it means that it can elliminate a lot fewer items.

The worst possible implementation of GetHashCode will cause all items to go in the same bucket:

public override int GetHashCode() {
  return 0;
}

This is still a valid implementation, but it means that the hash table gets the same performance as a regular list, i.e. it has to loop through all items in the collection to find a match.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
1

A simple HashSet<T> could be implemented like this(just a sketch, doesn't compile)

class HashSet<T>
{
    struct Element
    {
        int Hash;
        int Next;
        T item;
    }

    int[] buckets=new int[Capacity];
    Element[] data=new Element[Capacity];

    bool Contains(T item)
    {
        int hash=item.GetHashCode();
        // Bucket lookup is a simple array lookup => cheap
        int index=buckets[(uint)hash%Capacity];
        // Search for the actual item is linear in the number of items in the bucket
        while(index>=0)
        {
           if((data[index].Hash==hash) && Equals(data[index].Item, item))
             return true;
           index=data[index].Next;          
        }
        return false;
    }
}

If you look at this, the cost of searching in Contains is proportional to the number of items in the bucket. So having more buckets makes the search cheaper, but once the number of buckets exceeds the number of items, the gain of additional buckets quickly diminishes.

Having diverse hashcodes also serves as early out for comparing objects within a bucket, avoiding potentially costly Equals calls.

In short GetHashCode should be as diverse as possible. It's the job of HashSet<T> to reduce that large space to an appropriate number of buckets, which is approximately the number of items in the collection (Typically within a factor of two).

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • Thx for your example - I simply missed the point that the lookup in the bucket-list is cheap, which makes the whole point... – sl3dg3 Dec 12 '12 at 11:17