42

It's a requirement for any comparison sort to work that the underlying order operator is transitive and antisymmetric.

In .NET, that's not true for some strings:

static void CompareBug()
{
  string x = "\u002D\u30A2";  // or just "-ア" if charset allows
  string y = "\u3042";        // or just "あ" if charset allows

  Console.WriteLine(x.CompareTo(y));  // positive one
  Console.WriteLine(y.CompareTo(x));  // positive one
  Console.WriteLine(StringComparer.InvariantCulture.Compare(x, y));  // positive one
  Console.WriteLine(StringComparer.InvariantCulture.Compare(y, x));  // positive one

  var ja = StringComparer.Create(new CultureInfo("ja-JP", false), false);
  Console.WriteLine(ja.Compare(x, y));  // positive one
  Console.WriteLine(ja.Compare(y, x));  // positive one
}

You see that x is strictly greater than y, and y is strictly greater than x.

Because x.CompareTo(x) and so on all give zero (0), it is clear that this is not an order. Not surprisingly, I get unpredictable results when I Sort arrays or lists containing strings like x and y. Though I haven't tested this, I'm sure SortedDictionary<string, WhatEver> will have problems keeping itself in sorted order and/or locating items if strings like x and y are used for keys.

Is this bug well-known? What versions of the framework are affected (I'm trying this with .NET 4.0)?

EDIT:

Here's an example where the sign is negative either way:

x = "\u4E00\u30A0";         // equiv: "一゠"
y = "\u4E00\u002D\u0041";   // equiv: "一-A"
Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181
  • See also [this question](http://stackoverflow.com/questions/11467424/somestring-indexofsomestring-returns-1-instead-of-0-under-net-4) on string comparison fun when the string is containing a hyphen. See what happens under the CLR of .NET 3.5, which calls different Win32 API functions for string comparison. – CodeCaster Nov 06 '12 at 15:29
  • 1
    I have found .NET does not quite implement the full Unicode spec. Especially around casing, I have seen a few limitations (or bugs if you prefer to read it that way). – leppie Nov 06 '12 at 15:32
  • I can't repro this (via IronScheme) for either .NET 2 or .NET 4, but that is good news, as I might be doing something different. Will check what I do: http://eval.ironscheme.net/?id=72. Edit: OK, mine was different due to doing `Ordinal` comparison. Will this work for you perhaps? – leppie Nov 06 '12 at 15:43
  • 2
    Note that 'ア' and 'あ' are two different symbols for the same Japanese syllable 'a'. One would think the sort order would be consistent, though, regardless of whether Mr Hyphen gets involved or not. – Paul Ruane Nov 06 '12 at 15:51
  • @leppie you probably can't repo it because string comparison on the CLR calls Win32 APIs and I guess IronScheme's DLR doesn't. – CodeCaster Nov 06 '12 at 16:00
  • 4
    Japanese sorting is based on pronunciation. Problem is, a character can have multiple pronunciations. You need a yomigana support library to get this right. – Hans Passant Nov 06 '12 at 17:00
  • 1
    @leppie Yes, the bug doesn't reproduce with `Ordinal`. `Ordinal` is more or less just treating each `char` as its corresponding number, and doing "lexicographical" comparison on the resulting lists of numbers, so that's really hard to get wrong. `Ordinal` is also faster. But the default (`Comparer.Default`) has this error, and we can't expect everyone to switch to `Ordinal` comparison (which is rarely useful if you're sorting text strings). – Jeppe Stig Nielsen Nov 06 '12 at 17:03
  • any final solution about it ? – Kiquenet Jun 19 '13 at 11:23

2 Answers2

17

If correct sorting is so important in your problem, just use ordinal string comparison instead of culture-sensitive. Only this one guarantees transitive and antisymmetric comparing you want.

What MSDN says:

Specifying the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase value in a method call signifies a non-linguistic comparison in which the features of natural languages are ignored. Methods that are invoked with these StringComparison values base string operation decisions on simple byte comparisons instead of casing or equivalence tables that are parameterized by culture. In most cases, this approach best fits the intended interpretation of strings while making code faster and more reliable.

And it works as expected:

    Console.WriteLine(String.Compare(x, y, StringComparison.Ordinal));  // -12309
    Console.WriteLine(String.Compare(y, x, StringComparison.Ordinal));  // 12309

Yes, it doesn't explain why culture-sensitive comparison gives inconsistent results. Well, strange culture — strange result.

shuribot
  • 379
  • 3
  • 13
  • 3
    Ordinal comparison can be a relevant option in some cases, but it doesn't change the fact that it's not the default, and it does legitimize errors in the implementation of the other comparison types. Strange culture? I agree that **`InvariantCulture`** is a strange culture, but this problem happens in all .NET cultures. If you're referring to Japanese culture, I don't think there's anything in Japanese culture that claims that something can be simultaneously greater than and less than something else? – Jeppe Stig Nielsen Nov 06 '12 at 22:39
  • 3
    Of course, you're right. I was joking. Really you can try to submit this bug to [Microsoft](http://connect.microsoft.com/VisualStudio). – shuribot Nov 07 '12 at 08:13
  • 1
    And how do you sort strings for the UI? Sorting with a bogus comparer is not guaranteed to succeed (see source code for BCL sort algorithm). Any (sane) workarounds? – usr Nov 18 '12 at 21:34
1

I came across this SO post, while I was trying to figure out why I was having problems retrieving (string) keys that were inserted into a SortedList, after I discovered the cause was the odd behaviour of the .Net 40 and above comparers (a1 < a2 and a2 < a3, but a1 > a3).

My struggle to figure out what was going on can be found here: c# SortedList<string, TValue>.ContainsKey for successfully added key returns false.

You may want to have a look at the "UPDATE 3" section of my SO question. It appears that the issue was reported to Microsoft in Dec 2012, and closed before the end of january 2013 as "won't be fixed". Additionally it lists a workaround that may be used.

I created an implementation of this recommended workaround, and verified that it fixed the problem that I had encountered. I also just verified that this resolves the issue you reported.

public static void SO_13254153_Question()
{
    string x = "\u002D\u30A2";  // or just "-ア" if charset allows
    string y = "\u3042";        // or just "あ" if charset allows        

    var invariantComparer = new WorkAroundStringComparer();
    var japaneseComparer = new WorkAroundStringComparer(new System.Globalization.CultureInfo("ja-JP", false));
    Console.WriteLine(x.CompareTo(y));  // positive one
    Console.WriteLine(y.CompareTo(x));  // positive one
    Console.WriteLine(invariantComparer.Compare(x, y));  // negative one
    Console.WriteLine(invariantComparer.Compare(y, x));  // positive one
    Console.WriteLine(japaneseComparer.Compare(x, y));  // negative one
    Console.WriteLine(japaneseComparer.Compare(y, x));  // positive one
}

The remaining problem is that this workaround is so slow it is hardly practical for use with large collections of strings. So I hope Microsoft will reconsider closing this issue or that someone knows of a better workaround.

Community
  • 1
  • 1
Alex
  • 13,024
  • 33
  • 62
  • This was about a case where comparing `x` to `y` gave an answer inconsistent with comparing `y` to `x`. Now I found a case (also with hyphens, but ASCII only) where `x.CompareTo(y)` is consistent with `y.CompareTo(x)`, however there are three strings `a`, `b` and `c` that violate transitivity, i.e. they act as "rock", "paper", "scissors" in the well-known game. Again it is fixed with your `WorkAroundStringComparer`. The transitivity problem (just like the antisymmetry problem of this thread) was not present in .NET 3.5 (2008). See [my new thread](http://stackoverflow.com/questions/23087995/). – Jeppe Stig Nielsen Apr 17 '14 at 12:19