I am trying to get a better understanding how the internas of hashed sets, e.g. HashSet<T>
do work and why they are performant. I discovered following article, implementing a simple example with a bucket list http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/.
As far as I understand this article (and I also thought that way before), the bucket list itself groups certain amount of elements in each bucket. One bucket is represented by the hashcode, namely by GetHashCode
which is called on the element. I thought the better performance is based on the fact that there are less buckets than elements.
Now I have written following naive test-code:
public class CustomHashCode
{
public int Id { get; set; }
public override int GetHashCode()
{
//return Id.GetHashCode(); // Way better performance
return Id % 40; // Bad performance! But why?
}
public override bool Equals(object obj)
{
return ((CustomHashCode) obj).Id == Id;
}
}
And here the profiler:
public static void TestNoCustomHashCode(int iterations)
{
var hashSet = new HashSet<NoCustomHashCode>();
for (int j = 0; j < iterations; j++)
{
hashSet.Add(new NoCustomHashCode() { Id = j });
}
var chc = hashSet.First();
var stopwatch = new Stopwatch();
stopwatch.Start();
for (int j = 0; j < iterations; j++)
{
hashSet.Contains(chc);
}
stopwatch.Stop();
Console.WriteLine(string.Format("Elapsed time (ms): {0}", stopwatch.ElapsedMilliseconds));
}
My naive thought was: Let's reduce the amount of buckets (with a simple modulo), that should increase performance. But it is terrible (on my system it takes about 4 seconds with 50000 iterations). I also thought if I simply return the Id as hashcode, performance should be poor since I would end up with 50000 buckets. But the opposite is the case, I guess I simply produced tones of so called collisions instead of improving anything. But then again, how do the bucket lists work?