1

I'm trying to implement an IEqualityComparer<string> which basically compares two strings in a way that,(let's assume we have two strings x and y) if x starts with y or y starts with x they should be treated as equal.

public bool Equals(string x, string y)
{
    return x.StartsWith(y) || y.StartsWith(x);
}

public int GetHashCode(string obj)
{
    return obj.GetHashCode();
}

Ofcourse implementing the Equals method is pretty easy.But the GetHashCode is not, I couldn't think any way to implement it correctly.I have written a test program like this:

string[] values = {"hell", "hello", "foo", "fooooo"};

var result = values.Distinct(new StringComparer());

foreach(var x in result)
   Console.WriteLine(x);

And I get the wrong result because of GetHashCode:

hell
hello
foo
fooooo

Obviously I can force calling Equals method by returning same value from the GetHashCode for all values but I wanna know if there is another way to implement it because the performance is critical. Is there a way to implement GetHashCode method correctly for my situation ?

Note: I know it is vague but I couldn't find a better title, if you have a better idea you are free to edit.


Edit: I'm going to use this logic with web urls. In my situation first 20 characters are equal. For example:

http://www.foo.com/bar?id=3
http://www.foo.com/bar?id=3&fooId=23
Selman Genç
  • 100,147
  • 13
  • 119
  • 184
  • Since `"f", "fo", "foo", "fooo"` considered being equal to each other, I don't see any better solution but to return first character (`'f'`) as a hash code. – Dmitry Bychenko Aug 27 '14 at 10:05
  • 2
    Given your use case the best thing is to not treat them as strings. Parse them and then decide what parts of the url must match and what must not. For example I assume `http://www.foo.com/bar?id=3&fooId=23` and `http://www.foo.com/bar?fooId=23&id=3` should be considered equal which they won't by your mechanism. If you say the main url (ie before the `?` must match and then say the following query string parameters are relevant (eg only id) then you can construct your hash based on those parts and your equality and it will be more robust (see also chiccodoro's answer for why this is not robust) – Chris Aug 27 '14 at 10:27
  • The title of your question was strong enough to attract the right people :-) – chiccodoro Aug 27 '14 at 10:27
  • As for your question, the issue starts with your definition of `Equals`, not with `GetHashCode`. See my answer. Furthermore, given your added use case explanation, @Chris' comment might be an important hint. – chiccodoro Aug 27 '14 at 10:28
  • @chiccodoro: We all love a good hashcode question. ;-) – Chris Aug 27 '14 at 10:28
  • I should note I am happy to expand on my above comment but I don't feel it answers the question as asked which is why I didn't make it an answer. Hopefully from the rough outline you can expand it yourself if you want to though. :) – Chris Aug 27 '14 at 10:32
  • 1
    `Edit: I'm going to use this logic with web urls` A good example for [XY Problem](http://www.perlmonks.org/?node=xy+problem) – EZI Aug 27 '14 at 12:38

2 Answers2

4

The issue is in your definition of equality: Equality must be transitive. But it is not in your case. Take the following three values:

* f
* freeze
* foo

Then f == freeze, and foo == f, but freeze != foo.

See also MSDN on Implementing the Equals Method, which says:

(x.Equals(y) && y.Equals(z)) returns true if and only if x.Equals(z) returns true.

A proper definition of equality produces distinct sets of values that are considered equal. If you had those, you could define a "canonical" representation for each set and calculate the hash of the canonical value so each set would have its hash code. But this only works with an operation that is transitive (as well as commutative and reflexive, these two properties are covered by your definition).

Since your definition of equality is not transitive you cannot define such sets so you can't find a proper hash code either.

But it raises other questions, too. Taking your example:

string[] values = { "hell", "hello", "foo", "fooooo" };
var result = values.Distinct(new StringComparer());

Which values do you expect to go into your result? Do you always want the shortest version? This will not be guaranteed by your code, the result will depend on the internal implementation of Distinct.

Implementing an EqualityComparer might possibly be a sub-optimal approach to your actual issue. What are you trying to achieve?

chiccodoro
  • 14,407
  • 19
  • 87
  • 130
  • I want the longest but I thought it shouldn't matter. :) Does my edit answer your question _What are you trying to achieve?_ – Selman Genç Aug 27 '14 at 10:33
  • My first guess would be that `Distinct` keeps the first out of a set of equal values, but as I said, it is an implementation detail. If it was true it would happen to yield `hell` and `foo`. – chiccodoro Aug 27 '14 at 10:36
  • As for your edit: It gives another example but your intention is not yet clear to me. Why do you want to match URLs like this? – chiccodoro Aug 27 '14 at 10:36
  • 1) don't bother with Distinct I can get the longest using `values.GroupBy(x => x, new StringComparer()).Select(g => g.MaxBy(x => x.Length)`, 2) this is not relevant with my question but if you wanna know it is because of a web site that was horribly designed :) consider this: if I go to `www.foo.com/productId=2` I can see only name and code of a product. But in `www.foo.com/productId=2&categoryId=23` I see other informations like price... that was the real reason. – Selman Genç Aug 27 '14 at 10:41
  • @Selman22 - absolutely - it is not relevant for this question. So literally my answer to your question is what I already wrote concerning your definition of `Equals`. I just wanted to go a step further and find what intention might lead to your idea of defining equality like this. Knowing the conceptual background can sometimes help give more precise answers for what the asker is really after. As @Chris already pointed out, your definition of equality might have issues for your case, too. E.g. `/bar/pId=2&catId=23` will not equal `/bar/catId=23&pId=2` even though for your web app it is. – chiccodoro Aug 27 '14 at 12:04
2

As strings are equal to each other depending on what string you compare them with, any string can be equal to another. Thus there is only one way to implement the GetHashCode method; return the same value for all strings:

public int GetHashCode(string obj) {
  return 0;
}

This will naturally give a horrible distribution. A dictionary will have a O(n) lookup time instead of O(1), but it works, and it's the only way to make it work for such an equality comparison.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • Your answer is entirely correct. Although two equal strings must produce the same hash code, two strings with the same hashcode don't necessarily have to be equal. However, as you point out, it is not really a useful solution. The reason is that equality is not properly defined (see also my answer). – chiccodoro Aug 27 '14 at 10:24