-1

I have a Dictionary with a custom hashing function. I want to test the hash function, because even though it returns different hash results for my test values, some of them may still map to the same bucket due to the modulo % operation. So how to check if there are collisions in C# Dictionary with custom hash function and improve that function?

This is a development test to fine-tune the hash function and won't go into production so no worries about the changes in internal implementation in other versions!!!

In C++ it's possible to get the map's bucket size to check the collision status but I couldn't find a way to do that in C#. How can I know if Dictionary has been collided?

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 6
    I think it's an implementation detail. Why do you need to know this? – Sweeper Jul 14 '20 at 10:20
  • Are you asking how to get the hash code (`GetHashCode()`)? – mjwills Jul 14 '20 at 10:21
  • Do you care if it is built into the library? The hash if the first check of the key to do a lookup in log2(N) time. After the hash is used a second check is done if there are duplicate hash values by comparing the dictionary key against the has key to get unique key value. – jdweng Jul 14 '20 at 10:39
  • @Sweeper just curious, and it's also useful to quickly check some custom hashing functions – phuclv Jul 14 '20 at 16:23
  • @mjwills of course I know how to use `GetHashCode()` because I'm implementing it myself for my class. However different hashes doesn't mean that collision has not occured because internally the dictionary may use some modulo operation, which maps different hashes to the same bucket – phuclv Jul 16 '20 at 03:39
  • 1
    Fair enough. When people talk about collisions they are often talking about equal hash codes - so just wanted to confirm what you were actually looking for. – mjwills Jul 16 '20 at 03:40

2 Answers2

5

You can get internal buckets in the following way:

var dictionary = new Dictionary<string, int>();
dictionary.Add("a", 8);
dictionary.Add("b", 1);
var buckets = dictionary.GetType().GetField("_buckets", BindingFlags.NonPublic | BindingFlags.Instance)
              .GetValue(dictionary); // use "buckets" for 4.x
Cihan Yakar
  • 2,402
  • 28
  • 30
  • `dictionary.GetType().GetField("_buckets", BindingFlags.NonPublic | BindingFlags.Instance)` returns `null` – Phate01 Jul 14 '20 at 10:30
  • 1
    I use .net core. For .NET 4.x you must changed _buckets to buckets – Cihan Yakar Jul 14 '20 at 10:38
  • 4
    @CihanYakar while this technically answers OPs question, it's not good practice. You're intentionally breaking encapsulation. If the internal implementation details change for any reason, your program stops working. – just.another.programmer Jul 14 '20 at 10:58
  • Yes! @just.another.programmer; any changes can do it. I agree with you. But for checking something it seems be ok. But it is not a reliable code ofcourse. Internal implementation can change in the next version. – Cihan Yakar Jul 14 '20 at 11:04
  • @CihanYakar you yourself already pointed at the implementation changed in a way that would have broken this code between .NET 4.x and .NET Core. There's a reason we use access modifiers to enforce encapsulation! Have a look at my answer for a way to solve the problem without breaking encapsulation. – just.another.programmer Jul 14 '20 at 11:18
  • Yes indeed! I just wanted to show how to access internal items. – Cihan Yakar Jul 14 '20 at 11:54
4

You're probably better off creating a custom Dictionary implementation that changes the Add and Remove methods to check for hash collisions based on the computer GetHashCode of the elements. You can compose with a "real" Dictionary internally to do the real work of storing the elements.

Here's a sample version. You could optimize the Add and Remove methods depending on the type of hashes your expecting.

public class CollisionDetectingDictionary<TKey, TValue> : IDictionary<TKey, TValue>
{
    private readonly Dictionary<TKey, TValue> InternalDictionary = new Dictionary<TKey, TValue>();
    private readonly List<int> HashCodesInDictionary = new List<int>();

    public event Action<int, TKey, IEnumerable<TKey>> HashCollision; 

    public TValue this[TKey key] { get => InternalDictionary[key]; set => InternalDictionary[key] = value; }
    public ICollection<TKey> Keys => InternalDictionary.Keys;
    public ICollection<TValue> Values => InternalDictionary.Values;
    public int Count => InternalDictionary.Count;
    public bool IsReadOnly => false;

    public void Add(TKey key, TValue value)
    {
        Add(new KeyValuePair<TKey, TValue>(key, value));
    }

    public void Add(KeyValuePair<TKey, TValue> item)
    {
        var hashCode = item.Key.GetHashCode();
        if (HashCodesInDictionary.Contains(hashCode))
        {
            var collisions = GetKeysByHashCode(hashCode);
            HashCollision?.Invoke(hashCode, item.Key, collisions);
        }

        Add(item);
    }

    private IEnumerable<TKey> GetKeysByHashCode(int hashCode)
    {
        foreach (var key in Keys)
        {
            if(key.GetHashCode() == hashCode)
            {
                yield return key;
            }
        }
    }

    public void Clear()
    {
        InternalDictionary.Clear();
    }

    public bool Contains(KeyValuePair<TKey, TValue> item)
    {
        return InternalDictionary.Contains(item);
    }

    public bool ContainsKey(TKey key)
    {
        return InternalDictionary.ContainsKey(key);
    }

    public void CopyTo(KeyValuePair<TKey, TValue>[] array, int arrayIndex)
    {
        ((IDictionary<TKey,TValue>)InternalDictionary).CopyTo(array, arrayIndex);
    }

    public IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator()
    {
        return InternalDictionary.GetEnumerator();
    }

    public bool Remove(TKey key)
    {
        var hashCode = key.GetHashCode();
        if(GetKeysByHashCode(hashCode).Count() == 1)
        {
            HashCodesInDictionary.Remove(hashCode);
        }

        return InternalDictionary.Remove(key);
    }

    public bool Remove(KeyValuePair<TKey, TValue> item)
    {
        return Remove(item.Key);
    }

    public bool TryGetValue(TKey key, out TValue value)
    {
        return InternalDictionary.TryGetValue(key, out value);
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return InternalDictionary.GetEnumerator();
    }
}
just.another.programmer
  • 8,579
  • 8
  • 51
  • 90
  • You can also write this methodology like : `var hasCollision = dictionary.Keys.GroupBy(k => k.GetHashCode()).Any(g => g.Count() > 1);` – Cihan Yakar Jul 14 '20 at 11:38
  • 2
    @CihanYakar yes, but that's a trade-off in efficiency. That requires calculating the hash code of the entire dictionary contents every time you add an item. For a large dictionary, you'll take a performance hit by doing that. In my implementation there's a performance hit on remove. Like I said, depending on the expected use of the dictionary will depend how you code the implementation. – just.another.programmer Jul 14 '20 at 11:42
  • Of course, it depends on purpose. I am asking just for clarifying and fun :). If it will check only after the dictionary filled up, LINQ solution will be better (sake of simplicity), but if the user wants to get an immediate response, your solution is good. And both solutions won't work with a custom comparer. – Cihan Yakar Jul 14 '20 at 11:52