1

Following up on https://stackoverflow.com/a/32233795/14731, I was surprised to discover that:

DECLARE @SampleData TABLE (ANSI VARCHAR(50), UTF16 NVARCHAR(50));
INSERT INTO @SampleData (ANSI, UTF16) VALUES 
    ('##MS_PolicyTsqlExecutionLogin##', N'##MS_PolicyTsqlExecutionLogin##'),
    ('_gaia', N'_gaia');

SELECT sd.ANSI AS [ANSI-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

Results in:

ANSI-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
##MS_PolicyTsqlExecutionLogin##
_gaia

UTF16-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
##MS_PolicyTsqlExecutionLogin##
_gaia

When, according to "Why doesn't ICU4J match UTF-8 sort order?", the Unicode results are supposed to be in the opposite order. Why is this the case?

Community
  • 1
  • 1
Gili
  • 86,244
  • 97
  • 390
  • 689

2 Answers2

1

First things first: the linked question -- Why doesn't ICU4J match UTF-8 sort order? -- hasn't been shown to be entirely correct yet ;-).

That related info aside, let's look at the various pieces:

  1. VARCHAR field with COLLATE SQL_Latin1_General_CP1_CI_AS:

    This is going to sort primarily based on ASCII values, and in the case of alphabetic characters, will sort and compare based on rules defined in Code Page 1 (a.k.a. Code Page 1252).

    The # character is ASCII code 35 while the _ character is ASCII code 95. These are not alphabetic characters so one should assume that they would sort with the # coming first when doing an ASCending order, as you are doing here.

  2. NVARCHAR field with COLLATE SQL_Latin1_General_CP1_CI_AS:

    This is going to sort according to Unicode rules. There are no Code Pages in Unicode, BUT there could be cultural differences that override the default sort rules and ordering. AND, to make things even more interesting, both the base rules and culture/locale -specific overrides can (and do) change over the years. Software vendors are not always that quick to implement new versions of standards. This is no different than various browsers implementing different W3C specifics at different points in time. The major updates in SQL Server came with version 2008 which introduced the 100 series of collations. SQL Server 2012 introduced variants of the 90 and 100 series, ending in _SC, to handle supplementary characters (i.e. the rest of the UTF-16 characters beyond the UCS-2 set).

    Going back to something mentioned a moment ago, each locale / culture can specify overrides of any of the rules (and not just sorting rules). The current version, 28 (released just 4 days ago!!), has the following for the US locale (found at: http://www.unicode.org/repos/cldr/tags/release-27/common/collation/en_US_POSIX.xml )

    <collation type="standard">
      <cr>
      <![CDATA[
        &A<*'\u0020'-'/'<*0-'@'<*ABCDEFGHIJKLMNOPQRSTUVWXYZ<*'['-'`'<*abcdefghijklmnopqrstuvwxyz <*'{'-'\u007F'
      ]]>
      </cr>
    </collation> 
    

    Reading the new syntax isn't super-easy, but I don't think they are reordering any of these punctuation characters. And if you go to their Collation Charts and click on the 4 link down (starting at the top, left), for "Punctuation", it certainly does list "_" as coming before all but one character.

    If we go back a few versions, we find (found at: http://www.unicode.org/repos/cldr/tags/release-23/common/collation/en_US_POSIX.xml ):

    <collation type="standard">
      <rules>
        <reset>A</reset>
        <pc>!"#$%&'()*+,-./</pc>
        <pc>0123456789:;<=>?@</pc>
        <pc>ABCDEFGHIJKLMNOPQRSTUVWXYZ</pc>
        <pc>[\]^_`</pc>
        <pc>abcdefghijklmnopqrstuvwxyz</pc>
        <pc>{|}~</pc>
      </rules>
    </collation> 
    

    Now here it does certainly look like they reordered it, and in the same order as the ASCII values?

    If you change the URL to point to version 24, that will look just like the current version 28 XML.

    According to the release dates found here CLDR Releases/Downloads, version 24 came out in 2013, well after the 100 series of collations were coded.

Community
  • 1
  • 1
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • Do you have a source for the claim that SQL Server uses the UCA or CLDR rules? You link to the POSIX rules, but why would Windows use these rules for Latin1 (which is a different character set)? Also, the two POSIX orders you give are the same, they just [updated the syntax](http://unicode.org/cldr/trac/ticket/5551). – 一二三 Sep 22 '15 at 05:38
  • Nice theory but I don't think it's actually correct. It turns out that versioned collation charts are published at http://cldr.unicode.org/index/charts. Comparing http://www.unicode.org/cldr/charts/23/by_type/patterns.characters.html#6cf943e652b01478 to http://www.unicode.org/cldr/charts/28/by_type/core_data.alphabetic_information.html#6cf943e652b01478 it looks like the sorting order (for the two characters in question) is identical. So either SQL Server is using an older version or something else must be going on. – Gili Sep 22 '15 at 08:39
  • @Gili It's very unclear what Microsoft means by "Unicode Collation" rules. Another thing to keep in mind is that the UCA rules are designed to be customised. – 一二三 Sep 22 '15 at 08:51
  • @Gili I am short on time now, but: look more closely at those two collation charts and you will notice that there is no underscore ;-). Do a Crtl-F in your browser to test it. Not there for the "en" locale. And as I have been saying, SQL Server _is_ using an old version. I never said v23 was being used, just that I saw at least a notation change starting in v24. But looking at the release dates, if the newest collations in Windows that SQL Server can use (the 100 series) came out in 2008, then the CLDR version had to already exist, which could be v1.4 from 2006 or v1.5 from 2007. More later.. – Solomon Rutzky Sep 22 '15 at 17:24
  • @一二三 I am short on time, but: while not definitive proof (I will look for better), there is an indication here at the end of the remarks section: [SortKey Class](https://msdn.microsoft.com/en-us/library/system.globalization.sortkey.aspx). Do you have a source for your claim? ;-) I found 2 mentions of it, but they were in 2004 and 2005. Also, I am only looking into the Unicode handling now, hence only a single character set. – Solomon Rutzky Sep 22 '15 at 17:47
  • @srutzky I tried older versions of icu4j, going back as far as version 3.4.4 (CLDR 1.2) and the collation was still the opposite of the SQL case so I really think something else is going on. Let me know once you find out more. – Gili Sep 22 '15 at 17:58
0

It turns out that @一二三 is right about SQL Server not implementing the default Unicode Collation Algorithm rules, but he was wrong about it using a codepage for unicode sorting. https://stackoverflow.com/a/32706510/14731 contains a detailed explanation of how unicode sorting is really implemented.

Community
  • 1
  • 1
Gili
  • 86,244
  • 97
  • 390
  • 689