6

I have following C# code compiled as Sort.exe:

using System;
using System.Collections.Generic;

class Test
{
    public static int Main(string[] args)
    {
        string text = null;
        List<string> lines = new List<string>();
        while((text = Console.In.ReadLine()) != null)
        {
            lines.Add(text);
        }

        lines.Sort();

        foreach(var line in lines)
            Console.WriteLine(line);

        return 0;
    }
}

I have a file input.txt which has following 5 lines as its content:

x000000000000000000093.000000000
x000000000000000000037.000000000
x000000000000000100000.000000000
x000000000000000000538.000000000
x-00000000000000000020.000000000

Now if I run it on command prompt following is the output:

C:\Users\girijesh\AppData\Local\Temp>sort < input.txt
x000000000000000000037.000000000
x000000000000000000093.000000000
x-00000000000000000020.000000000
x000000000000000000538.000000000
x000000000000000100000.000000000

I am not able to understand what kind of string sorting it is where string starting with x-(3rd line in output) comes in middle of strings starting with x0. Either 3rd line should have been at the top or at the bottom. Excel is also showing the same behaviour.

Sayse
  • 42,633
  • 14
  • 77
  • 146
  • So, are you asking about the internal works of sorting strings? Or how to sort it so that the list is shown like the first one? – Nahuel Ianni May 14 '14 at 13:16
  • What happens when you use `lines = lines.OrderBy(line => line).ToList();` ? – Kamil T May 14 '14 at 13:17
  • @Nahuell he is asking why the Standard sort function seems to give incorrect results. – Adrian Ratnapala May 14 '14 at 13:18
  • 2
    Your question would be better (IMO) if you got rid of the file handling part and just populated the `List` inline. I have such code myself now, as it was easier to test - would you be happy for me to update the question? – Jon Skeet May 14 '14 at 13:18
  • 11
    See [this question](http://stackoverflow.com/questions/23087995/string-comparison-and-sorting-when-strings-contain-hyphens). Basically, hyphens are treated as "ignorable". – Damien_The_Unbeliever May 14 '14 at 13:18
  • You will need `StringComparison.Ordinal` to fix it. – Sriram Sakthivel May 14 '14 at 13:19
  • Sounds like Damien has the answer. Gah! All these clever locale-aware functions are sometimes more of a menace than a help. – Adrian Ratnapala May 14 '14 at 13:20
  • Interestingly, in LINQPad on my machine, the `x-` line is *last*. – Bobson May 14 '14 at 13:23
  • Related :[SortedList/SortedDictionary weird behavior](http://stackoverflow.com/questions/19370734/sortedlist-sorteddictionary-weird-behavior) Possibly a duplicate. – Sriram Sakthivel May 14 '14 at 13:23
  • @JonSkeet sir that was only to have a reproducible way to the problem. This problem even exists in SQL server order by clause. If a column contains those 5 values and we do an order by on the same column then result is same as above and there I don't know have the option StringComparer.Ordinal. – user2176811 May 14 '14 at 13:56
  • @user2176811: You don't need a file (or console input) to reproduce the problem though. Just populating the `List` with: `var list = new List { "...", "...", "...", "...", "..." };` with the same values reproduced the problem much more simply. – Jon Skeet May 14 '14 at 13:57

1 Answers1

6

In many cultures (including the invariant culture) the hyphen is a character that is of only minor importance for sorting purposes. In most texts, this makes sense: pre-whatever and prewhatever are pretty similar. For example, the following list is sorted as this, which I think is good:

preasdf
prewhatever
pre-whatever
prezxcv

You seem to want an Ordinal comparison, where values are compared purely by their unicode code point values. If you change the line to:

lines.Sort(StringComparer.Ordinal);

Then your results are:

x-00000000000000000020.000000000
x000000000000000000037.000000000
x000000000000000000093.000000000
x000000000000000000538.000000000
x000000000000000100000.000000000

If you're wondering why the -...20.0 value ended up where it did, consider what it'd look like if you removed the - (and compare with the above pre list).

x000000000000000000037.000000000
x000000000000000000093.000000000
x00000000000000000020.000000000
x000000000000000000538.000000000
x000000000000000100000.000000000

If your input is always in the format x[some number], I'd parse the value after x as a decimal or double, and do the sorting on that. That would make it easier to ensure expected behavior, and overall better.

Tim S.
  • 55,448
  • 7
  • 96
  • 122
  • So the `-` character is not considered right? I mean without `ordinal` string `x-00000000000000000020` is same as `x00000000000000000020`? – Rahul May 14 '14 at 13:28
  • @Rahul That's what I thought at first, but it's not quite right (that was for the soft hyphen, not the regular hyphen). I've amended my answer to clarify how hyphens are treated. They are of lower importance, but not none. – Tim S. May 14 '14 at 13:37
  • Nothing wrong with your answer and it explain pretty well but I am still in confusion as how sort is handling `-` in this case? Will have to find out. – Rahul May 14 '14 at 13:40
  • I think it's equivalent to this (in the relatively simple examples here): first, compare the strings without the hyphens included. If the strings are different, use that comparison as the answer; if they're the same, then the one with the hyphen goes after the one without. This is why `prewhatever` and `pre-whatever` end up next to each other, and in between the other two. – Tim S. May 14 '14 at 13:44
  • Make total sense; I think sort is considering [a-z][A-Z][0-9] first and then going for special characters like -/;/etc. – Rahul May 14 '14 at 13:46
  • Is there anything like StringComparer.Ordinal for SQL Server Order By or Excel sorting as well since these problem is present there too? – user2176811 May 14 '14 at 13:52
  • @user2176811, post that question as separate post for it to be isolated since these are two different technologies. – Rahul May 14 '14 at 13:55