5

I'm looking for built-in alternatives of HashSet and Dictionary objects that have better performance than lists but do not use the internal GetHashCode method. I need this because for the class I have written, there is no way of writing a GetHashCode method that fulfills the usual contract with Equals other than

public override int GetHashCode() { return 0; } // or return any other constant value

which would turn HashSet and Dictionary into ordinary lists (performance-wise).

So what I need is a set implementation and a mapping implementation. Any suggestions?

EDIT:

My class is a tolerance-based 3-dimensional vector class:

public class Vector
{
    private static const double TOL = 1E-10;
    private double x, y, z;

    public Vector(double x, double y, double z)
    {
        this.x = x; this.y = y; this.z = z;
    }

    public override bool Equals(object o)
    {
        Vector other = o as Vector;

        if (other == null)
            return false;

        return ((Math.Abs(x - other.x) <= TOL) &&
                (Math.Abs(y - other.y) <= TOL) &&
                (Math.Abs(z - other.z) <= TOL));
    }
}

Note that my Equals method is not transitive. However, in my use case I can make it "locally" transitive because at some point, I will know all vectors that I need to put into my set / mapping key set, and I also know that they will come in clusters. So when I have collected all vectors, I will choose one representative per cluster and replace all original vectors by the representative. Then Equals will be transitive among the elements of my set / mapping key set.

When I have my set or mapping, I will collect vectors from another source (for the sake of this question let's assume I'll ask a user to type in a vector). These can be any possible vector. Those will never be added to the set/mapping, but I will need to know if they are contained in the set / key set of the mapping (regarding tolerance), and I will need to know their value from the mapping.

Kjara
  • 2,504
  • 15
  • 42
  • 1
    You're going to need to provide more information about the object. Why can't you create a hash for it? What *can* you provide about it? How are you comparing the objects for equality, etc. – Servy Jul 26 '16 at 13:14
  • 1
    `SortedSet` and `SortedDictionary` – Ivan Stoev Jul 26 '16 at 13:17
  • @IvanStoev That assumes the objects have a consistent total ordering. – Servy Jul 26 '16 at 13:18
  • @Servy Sure. But I don't see other alternatives - they are either hash based (`IEqualityProvider`) and needs `GetHashCode`, or ordered (`IComparer`) and need `Compare`. – Ivan Stoev Jul 26 '16 at 13:21
  • @IvanStoev Like I said earlier, it's going to depend on what, specifically, these objects are, and how they relate to each other, and also what he's trying to do with them. It's possible his objects cannot be represented just in a set, and need a different type of collection entirely. There are all sorts of possibilities, and no way to know what could work with the information provided. – Servy Jul 26 '16 at 13:23
  • @Servy Agreed. That's why I'm just commenting and not answering:) – Ivan Stoev Jul 26 '16 at 13:25
  • Is there a documentation on which built-in classes and methods internally call `GetHashCode`? Could not find anything like this on MSDN documentation website... – Kjara Jul 26 '16 at 13:40
  • 1
    @Kjara Any of the data structures that use the object's hash code will document that in *that data structure's documentation*. – Servy Jul 26 '16 at 13:44
  • @IvanStoev if you turn your comment into an answer, I'll accept it. – Kjara Jul 26 '16 at 14:28
  • @Kjara It's not a valid answer to the question, since you don't have a total ordering. – Servy Jul 26 '16 at 14:34
  • @Servy The lexicographic ordering (for x, y, z values which are `double`) should do the trick. Or am I missing something? (Don't forget I'm only going to build sets from "nice" vector collections, i.e. where `Equals` works as expected.) – Kjara Jul 26 '16 at 14:54
  • @Kjara Because of how you've defined equality you've put yourself in a position where A == B, B == C, but A > C. Those data structures assume that that can't be the case. You can put yourself in a position where a value you're searching for can have any number of matches, and you won't know which one you get back as the match. – Servy Jul 26 '16 at 14:59
  • @Servy Globally, you are right. But note that among the vectors in question (the ones I'm going to put in a set), `Equals` IS transitive. Which means, `A==B` and `B==C` implies `A==C`. It's the same with the lexicographic ordering: It works among the vectors in question, but it does not work globally. – Kjara Jul 26 '16 at 15:02
  • 1
    @Kjara But not when you combine it with the vector you're searching with. Even if the items in the collection are more than the threshold apart, they would need to be more than twice the threshold apart for there to never be any item you search on that matches two of them. – Servy Jul 26 '16 at 15:06
  • Good point! I have to check if this is the case in my application. – Kjara Jul 26 '16 at 15:13
  • 1
    Why do you want to put your Vectors into a HashSet or Dictionary in the first place? What is the goal you want to acchieve? What would be the key to look up a Vector in a Dictionary? – wertzui Jul 27 '16 at 08:19
  • @wertzui the vector IS the key in my dictionary. The "why" takes a little longer to explain. I will probably do this at some point. – Kjara Aug 03 '16 at 19:34

2 Answers2

3

You need a data structure that supports sorting, binary search and fast insertion. Unfortunately there is no such collection in the .NET Framework. The SortedDictionary doesn't supports binary search, while the SortedList suffers from O(n) insertion for unsorted data. So you must search for a third party tool. A good candidate seems to be the TreeDictionary of C5 library. It is a red-black tree implementation that offers the important method RangeFromTo. Here is an incomplete implementation of a Dictionary that has Vectors as keys, backed internally by a C5.TreeDictionary:

public class VectorDictionary<TValue>
{
    private readonly C5.TreeDictionary<double, (Vector, TValue)> _tree =
        new C5.TreeDictionary<double, (Vector, TValue)>();

    public bool TryGetKeyValue(Vector key, out (Vector, TValue) pair)
    {
        double xyz = key.X + key.Y + key.Z;
        // Hoping that not all vectors are crowded in the same diagonal line
        var range = _tree.RangeFromTo(xyz - Vector.TOL * 3, xyz + Vector.TOL * 3);
        var equalPairs = range.Where(e => e.Value.Item1.Equals(key));
        // Selecting a vector from many "equal" vectors is tricky.
        // Some may be more equal than others. :-) Lets return the first for now.
        var selectedPair = equalPairs.FirstOrDefault().Value;
        pair = selectedPair;
        return selectedPair.Item1 != null;
    }

    public Vector GetExisting(Vector key)
    {
        return TryGetKeyValue(key, out var pair) ? pair.Item1 : default;
    }

    public bool Contains(Vector key) => TryGetKeyValue(key, out var _);

    public bool Add(Vector key, TValue value)
    {
        if (Contains(key)) return false;
        _tree.Add(key.X + key.Y + key.Z, (key, value));
        return true;
    }

    public TValue this[Vector key]
    {
        get => TryGetKeyValue(key, out var pair) ? pair.Item2 : default;
        set => _tree.Add(key.X + key.Y + key.Z, (key, value));
    }

    public int Count => _tree.Count;
}

Usage example:

var dictionary = new VectorDictionary<int>();
Console.WriteLine($"Added: {dictionary.Add(new Vector(0.5 * 1E-10, 0, 0), 1)}");
Console.WriteLine($"Added: {dictionary.Add(new Vector(0.6 * 1E-10, 0, 0), 2)}");
Console.WriteLine($"Added: {dictionary.Add(new Vector(1.6 * 1E-10, 0, 0), 3)}");
Console.WriteLine($"dictionary.Count: {dictionary.Count}");
Console.WriteLine($"dictionary.Contains: {dictionary.Contains(new Vector(2.5 * 1E-10, 0, 0))}");
Console.WriteLine($"dictionary.GetValue: {dictionary[new Vector(2.5 * 1E-10, 0, 0)]}");

Output:

Added: True
Added: False
Added: True
dictionary.Count: 2
dictionary.Contains: True
dictionary.GetValue: 3
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
-2

You can get a reasonably good hashcode implementation in your case. Remember that the most important rule for a hash code is the following:

  • Two equal vectors must return the same value

This does not mean that two different vectors cannot return the same value; they obviously have to in some cases, the number of hashes is limited, the number of distinct vectors for all purposes isn't.

Well, with that in mind, simply evaluate your hashcode based upon the vectors coordinates truncated to the tolerance's significant digits minus one. All equal vectors will give you the same hash and a small minority of non equal vectors that differ in the last decimal wont...you can live with that.

UPDATE: Changed rounded to truncated. Rounding is not the right choice.

InBetween
  • 32,319
  • 3
  • 50
  • 90
  • This isn't the case. Rounding isn't enough. Let's say that you have a tolerance of 5 units. Next say I have an item with a value of 7. If you just round to the nearest multiple of 5, then 7 would have a hash of 5. Next let's look at 11. It would have a hash of 10, but it's "equal" to 7, and yet it has a different hash. Rounding simply can't work for tolerance based equality, because fundamentally there has to be a rounding point, so you can always pick two items on either side of the rounding point within the threshold of each other. – Servy Jul 26 '16 at 13:36
  • Nope, doesn't work. Because `Equals` is not transitive. Take vector (0,0,0) and (1E-10, 1E-10, 1E-10). They are equal according to my implementation. So they must return the same HashCode. Take vector (1E-10, 1E-10, 1E-10) and (2*1E-10, 2*1E-10, 2*1E-10). They are equal again, so they must return the same HashCode. And so on and so on. Thus, the only way to be consistent with the contract of `HashCode` and `Equals` is to implement `GetHashCode` as a constant function. – Kjara Jul 26 '16 at 13:37
  • @Servy: It still is a valid hash, it might not be a good one, and it obviously isn't the greater `TOL` is, but the whole point of this is that `TOL` is sufficiently small. Your example is not fair becuse you are basically using integral counterexamples which in the context of this question makes no sense – InBetween Jul 26 '16 at 13:39
  • @Kjara. I dont follow you. You are computing the hash of both with the equivalent coordinates of `(0,0,0)`. `1E-10` rounded to the ninth decimal is `0.0`, so is `2E-10`. Both would give you the same hash. – InBetween Jul 26 '16 at 13:40
  • @InBetween No, it's *not* a valid hash because objects that are equal are resolving to different hashes. That's not valid for a hash. I used a large tolerance to make the example easier to read; the logic applies regardless of the actual value of the tolerance. Also, it's not the size of the tolerance that affects how often your hash is invalid, it's the likelihood of any two objects being within the tolerance. If the tolerance is very small but the values are very close together, then you can have more failures than a data set with a larger tolerance but even more spread out values. – Servy Jul 26 '16 at 13:42
  • @InBetween And yet you can use 5E-9 +/- 5E12, which has two values less than the threshold rounding to different hashes. – Servy Jul 26 '16 at 13:45
  • @Servy No, I dont exactly agree. My example is based on rounding in powers of ten. My algorithm is flawed though, what I had in mind is to *truncate* one significant digit less than the tolerance, not rounding. In that case the algorithm stands although based upon the data it might not be very good. – InBetween Jul 26 '16 at 13:45
  • @InBetween Look at my argument. I showed that (0,0,0) and (2*1E-10,2*1E-10,2*1E-10) must have the same HashCode using the required transitivity of GetHashCode. Using the same argument again and again leads to the fact that (0,0,0) and (n*1E-10,n*1E-10,n*1E-10) must return the same HashCode - for ANY n. So (0,0,0) and (1000,1000,1000) must also have the same HashCode. And all other numbers as well. – Kjara Jul 26 '16 at 13:46
  • @InBetween you are obviously not understanding our arguments at all. There can't be a reasonable implementation of `GetHashCode` which is not constant for my `Equals` method. Servy and I both tried to explain, but you don't seem to understand the underlying logic. Maybe looking at wikipedia "Mathematical Induction" and "Transitive relation" helps. – Kjara Jul 26 '16 at 13:54
  • @InBetween If you truncate instead of rounding then you simply need the pairs of numbers to be just above and just below a number who's least significant digit is 0, not 5, the idea of how to break it still applies. Find the rounding/truncation point and get a pair of numbers just above and below it, that are within the threshold. No matter how you round, that algorithm will always break your hash. – Servy Jul 26 '16 at 13:54
  • @InBetween .9999999999999999999 and 1.00000000000000000001. The first hashes to 0.99, the latter hashes to 1.00, but they're *much* closer than 0.005 units. Hash broken. – Servy Jul 26 '16 at 13:59
  • @Ok got it, rounding is the same but around `.995`: `.994999999` and `.99500001`. Duh. – InBetween Jul 26 '16 at 14:00
  • @InBetween You see why I used bigger numbers the first time; it's much easier to see when you can do everything with 1 digit numbers, but the logic is the same. – Servy Jul 26 '16 at 14:04
  • My approach is different than yours. I do not start with an example implementation of `GetHashCode` and try to verify that it works. Instead, I only look at the requirements of `GetHashCode` (which is "if `A.Equals(B)`, then `A.GetHashCode() == B.GetHashCode()`") and my implementation of `Equals` and infer rules from it. I started with `(0,...).Equals((1E-10,...))` and inferred `(0,...).GetHashCode() == (1E-10,...).GetHashCode()`. Now I used mathematical induction to show that if `((n-1)*1E-10,...).Equals((n*1E-10,...))` then `((n-1)*1E-10,...).GetHashCode() == (n*1E-10,...).GetHashCode()`. – Kjara Jul 26 '16 at 14:10
  • Since the `==` operator on `int` (HashCodes are ints) IS transitive, it follows that for any number `n` you choose, `(0,0,0).GetHashCode() == (n*1E-10, n*1E-10, n*1E-10).GetHashCode()`. This does not hold four YOUR implementation, but it HAS TO HOLD for an implementation of `GetHashCode` if you want to fulfill the contract with `Equals`. – Kjara Jul 26 '16 at 14:13
  • @Kjara That said, I *do* agree that there seems to be no way to implement a correct hashcode and that my answer is obviously wrong. I just don't understand how you are trying to prove the fact. – InBetween Jul 27 '16 at 06:29
  • @InBetween Your counterexample is not a counterexample. Just put the two vectors `(100*1E-10,...)` and `(101*1E-10,...)` into my `Equals` method. The difference in each entry is exactly `1E-10`, which is `TOL`, so `Equals` returns true. In my reasoning, I did not use transitivity of `Equals` at any point. If you still think I did, please tell me where by quoting the corresponding extract of my explanations. – Kjara Aug 03 '16 at 19:26