2

The other day I was reading that article on CodeProject

And I got hard times understanding a few points about the implementation of the .NET Dictionary (considering the implementation here without all the optimizations in .NET Core):

  • Note: If will add more items than the maximum number in the table (i.e 7199369), the resize method will manually search the next prime number that is larger than twice the old size.

  • Note: The reason that the sizes are being doubled while resizing the array is to make the inner-hash table operations to have asymptotic complexity. The prime numbers are being used to support double-hashing.

So I tried to remember my old CS classes back a decade ago with my good friend wikipedia:

But I still don't really see how first it relates to double hashing (which is a collision resolution technique for open-addressed hash tables) except the fact that the Resize() method double of the entries based on the minimum prime number (taken based on the current/old size), and tbh I don't really see the benefits of "doubling" the size, "asymptotic complexity" (I guess that article meant O(n) when the underlying array (entries) is full and subject to resize).

First, If you double the size with or without using a prime, is it not really the same?

Second, to me, the .NET hash table use a separate chaining technique when it comes to collision resolution.

I guess I must have missed a few things and I would like to have someone who can shed the light on those two points.

Natalie Perret
  • 8,013
  • 12
  • 66
  • 129
  • 4
    Doubling allocation size *avoids* O(n) reallocation cost, it reduces to O(log2(n)). All .NET container classes except LinkedList use it. Why rounding up to a prime is useful is well explained from [this example](https://www.newyorker.com/tech/annals-of-technology/the-cicadas-love-affair-with-prime-numbers) from nature. – Hans Passant Jan 06 '19 at 13:49
  • "Doubling allocation size avoids O(n) reallocation cost, it reduces to O(log2(n))" not sure to get this part right, when I checked the code (of `List`) there is still an `Array.Copy` call (extern) that actually copies everything over. – Natalie Perret Jan 06 '19 at 19:25
  • 1
    I don't think .NET uses double hashing, as the job of calculating the hash code is delegated to the objects themselves (at least in the default case), and you can't ask those objects to use different prime factors or a different hashing algorithm. – Lasse V. Karlsen Jan 08 '19 at 08:21
  • @Lasse Vågsæther Karlsen that's also what I thought, then got unsure since I read some people saying the opposite but usually without much justifications. – Natalie Perret Jan 08 '19 at 08:22
  • 1
    The current implementation (at least going by [reference source](https://referencesource.microsoft.com/#mscorlib/system/collections/generic/dictionary.cs,6d8e35702d74cf71) is that the dictionary maintains a "free buckets list", and reuses those by chaining them together with existing buckets when collisions occur. If no free slots are available, but the dictionary is not full, the next available entry is then used. If no free slots are available, and the dictionary is considered full, it is resized. – Lasse V. Karlsen Jan 08 '19 at 12:33
  • 1
    Additionally, strings get special treatment. When too many collisions occur on strings, a new randomized string comparer is then used (I don't know what the word "randomized" here means *exactly*, but that's part of the internal class name). I assume this is to handle the case where you add lots of strings that for some reason ends up with lots of collisions, then a new comparer is created which will (hopefully) distribute the strings with fewer collisions in a new dictionary. The entries are then redistributed with this new comparer. – Lasse V. Karlsen Jan 08 '19 at 12:34
  • 1
    However, all of this is implementation details. The only thing you can rely on is that it behaves as documented, and none of this is documented. – Lasse V. Karlsen Jan 08 '19 at 12:35
  • @LasseVågsætherKarlsen but it happens that some interviewers ask questions about implementation details (which is not that relevant ihmo). I am thinking about adding an a super detailed answer to another SO question about how the .NET Dictionary works and leave a note about the irrelevance of such interview questions... plus the implementation differ between the .NET Framework and .NET Core... – Natalie Perret Jan 08 '19 at 17:58
  • 1
    For some interview questions the only correct response is "Do you mean african or european?" – Lasse V. Karlsen Jan 08 '19 at 21:26
  • @LasseVågsætherKarlsen I am not really sure to get that one: https://www.youtube.com/watch?v=y2R3FvS4xr4 , am I supposed to expect my interviewer to be ejected like this old ugly witch? ;-) – Natalie Perret Jan 09 '19 at 09:42
  • 1
    My point was that sometimes you have to "call their bluff" and ask for clarification why they feel this is something you should know, most programmers in 2018/2018 aren't required to implement dictionaries or hashtables from scratch. Yes, sure, it's good to know *of* these things, but the day-to-day value of it is rather low. – Lasse V. Karlsen Jan 09 '19 at 11:59
  • @LasseVågsætherKarlsen agreed, like most GAFAm interviews :) – Natalie Perret Jan 09 '19 at 12:02
  • 1
    **However**, if you're the king, you're expected to know these things, you know. – Lasse V. Karlsen Jan 09 '19 at 14:31
  • @Lasse Vågsæther Karlsen I never pretended to be "the" king, tho... otherwise wouldn't be asking that sort of questions on SO. – Natalie Perret Jan 09 '19 at 14:48
  • 1
    That was a quote from the video you linked ;) – Lasse V. Karlsen Jan 09 '19 at 17:51

1 Answers1

0

I got my answer on Reddit, so I am gonna try to summarize here:

Collision Resolution Technique

First off, it seems that the collision resolution is using Separate Chaining technique and not Open addressing technique and therefore there is no Double Hashing strategy:

The code goes as follows:

private struct Entry 
{
    public int hashCode;    // Lower 31 bits of hash code, -1 if unused
    public int next;        // Index of next entry, -1 if last
    public TKey key;        // Key of entry
    public TValue value;    // Value of entry
}

It just that instead of having one dedicated storage for all the entries sharing the same hashcode / index like a list or whatnot for every bucket, everything is stored in the same entries array.

Prime Number

About the prime number the answer lies here: https://cs.stackexchange.com/a/64191/42745 it's all about multiple:

Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this be achieved? By choosing m to be a number that has very few factors: a prime number.

Doubling the underlying entries array size

Help to avoid call too many resize operations (i.e. copies) by increasing the size of the array by enough amount of slots.

See that answer: https://stackoverflow.com/a/2369504/4636721

Hash-tables could not claim "amortized constant time insertion" if, for instance, the resizing was by a constant increment. In that case the cost of resizing (which grows with the size of the hash-table) would make the cost of one insertion linear in the total number of elements to insert. Because resizing becomes more and more expensive with the size of the table, it has to happen "less and less often" to keep the amortized cost of insertion constant.

Natalie Perret
  • 8,013
  • 12
  • 66
  • 129