Numeric sort order for strings containing same-length integers under all collations

Question

Is it safe to assume that all collations in SQL Server will give the "expected" (i.e. numeric) sort order on strings containing integers of the same length? For example, assuming that @text contains just non-negative integers ([0-9]+), would the following snippet work for ensuring that the value does not overflow the int range, or might there be some collation where @text <= '2147483647' gives unexpected results?

IF LEN(@text) BETWEEN 1 AND 9 
OR LEN(@text) = 10 AND @text <= '2147483647'
-- ...

There won't be cases of variable-length comparisons (e.g. '2' < '11'), so please do not address that issue.

Why invent a new way to safe convert to integer? Use TRY_CONVERT on SQL2012+ or ISNUMERIC. i.e. `CASE WHEN ISNUMERIC(@text + '.e0') = 1 THEN CASE WHEN CONVERT(bigint, @text) BETWEEN -2147483648 AND 2147483647 THEN CONVERT(int, @text) END END` — adrianm, Jun 17 '14 at 07:41
@adrianm: I had to devise [my own logic](http://stackoverflow.com/a/24250511/1149773) because none of the online solutions I came across could handle all the border cases (and I need it to be compatible with SQL 2008). Yours fails for numbers that overflow the `bigint` type (e.g. `9999999999999999999`). Which might seem unrealistic, until you realize that [BBANs](http://www.swift.com/dsp/resources/documents/IBAN_Registry.pdf) can consist of 24 digits. — Douglas, Jun 17 '14 at 16:23
Convert to `float` instead of `bigint`. That should handle up to 300+ digits. — adrianm, Jun 18 '14 at 06:40
Yes, that would work (and `ISNUMERIC` seems to protect against float overflows). However, this discussion is outside the scope of the present question, and would be better-suited at http://stackoverflow.com/q/2000045/1149773. I still would like to know whether, for example, `'123' < '321'` is guaranteed to evaluate to `true` under all collations. — Douglas, Jun 18 '14 at 18:43
I am with Linoff. I don't think are going to find hard documentation that all collations sort numbers the same because a collation is not required to. It seems reasonable they all will. I really think the answer you need to create some test cases and test them against all collations. There are only like 150. — paparazzo, Jun 20 '14 at 20:30
A counter-example will be a `COLLATION` where the number are not ordered usual way, I greatly hope that there is no such a thing — Serpiton, Jun 20 '14 at 21:46

score 3 · Answer 1 · answered Jun 20 '14 at 03:05

SQL Server collations do not guarantee anything about the encodings. They are mappings from binary representations of characters to the commonly understood characters.

For this purpose, I think the relevant concept is "code page":

Code Page

A code page is an ordered set of characters of a given script in which a numeric index, or code point value, is associated with each character. A Windows code page is commonly referred to as a character set or charset. Code pages are used to provide support for the character sets and keyboard layouts that are used by different Windows system locales. All Windows Server 2008 Unicode collations are Unicode 5.0-based.

The "ordered" piece is the part of interest for this problem. The ordering determines whether the characters are "naturally" ordered for numbers.

So, the concept of collations does not requires that numbers be naturally ordered.

That said, I am not aware of any collation anywhere where the digits '0' - '9' are not ordered naturally. In the Unicode standard, numbers are ordered naturally. I cannot think of a reason why anyone would create such a collation. So, in practice, I would be very, very surprised if such a collation existed. And, if it did, it would probably not be Unicode-compliant and so would not be available in SQL Server.

In practice, no such collation exists or at least no such collation is in wide-spread use. Natural ordering is not guaranteed by the definition of a collation, but is part of the Unicode character sets. You are pretty safe in assuming that all collations have natural ordering of numbers, but it is theoretically possible to create a character set with non-natural ordering of digits.

I agree. You can get all the documentation about how things are now, and the assumptions they use, but you are effectively relying on a side-effect, albeit a deliberate one. This would be a very dangerous solution as the side-effect's reliance extends off into the future. — Phil Wallach, Jun 21 '14 at 08:36
I'm sorry, but your statements about code pages seem to be factually incorrect. The order of the characters in the code page only corresponds to the collation's sort order for *binary* collations. Non-binary collations define a sort order that can be entirely different from the the underlying code page (e.g. for case-insensitivity). — Douglas, Jun 21 '14 at 15:39
As for Unicode, I'm aware that the code points for the digits `0`–`9` are sequential (`U+0030`–`U+0039`); however, that does not, in itself, prevent a collation from altering their sort order. For example, the code point for `B` (`U+0042`) is smaller than `a` (`U+0061`); however, `'B' < 'a'` evaluates to `false` for non-binary collations. — Douglas, Jun 21 '14 at 15:43
@Douglas Exactly nothing prevents a collation from altering. And yet you think you are going to get some "authoritative" affirmation from SO that none do. — paparazzo, Jun 21 '14 at 22:46
@Blam: You have a bad habit of misinterpreting what I say. I said that *code points* do not mandate the sort order, not that nothing does. The authoritative answer for Unicode collations probably lies in [UTS #10](http://www.unicode.org/reports/tr10/#Main_Algorithm), which I'd hoped someone might be familiar with. — Douglas, Jun 21 '14 at 23:33

score 2 · Answer 2 · answered Jun 25 '14 at 08:13

2

ORDER BY CASE WHEN ISNUMERIC(YourColumn) = 1 THEN Cast(YourColumn as Decimal(38,0)) ELSE null End

answered Jun 25 '14 at 08:13

mehdi lotfi

11,194
18
82
128

paparazzo · Answer 3 · 2014-06-21T16:02:55.607

Like I said in my comment I don't think you are going to find hard documentation that all collations sort numbers the same as that is not a requirement of the collation. But it would be reasonable to assume/expect that. I cannot think of a single use case where I would not expect numbers to sort the same.

Kind of grunge but test them all
The first select just creates the selects
Then copy paste to run the output as selects
There is probably a way to do this in loop I could not get it to use collate to a variable

None failed this simple '54321' < '22222' or N'22222' < N'22221' test for me

SELECT 'SELECT ''' + name + ''' where  ''54321'' < ''22222'' or N''22222'' < N''22221'' COLLATE ' + name FROM sys.fn_helpcollations() WHERE name NOT LIKE 'SQL%';

set nocount on 
SELECT 'Albanian_BIN' where  '54321' < '22222' or N'22222' < N'22221' COLLATE Albanian_BIN
SELECT 'Albanian_BIN2' where  '54321' < '22222' or N'22222' < N'22221' COLLATE Albanian_BIN2
SELECT 'Albanian_CI_AI' where  '54321' < '22222' or N'22222' < N'22221' COLLATE Albanian_CI_AI
SELECT 'Albanian_CI_AI_WS' where  '54321' < '22222' or N'22222' < N'22221' COLLATE Albanian_CI_AI_WS
SELECT 'Albanian_CI_AI_KS' where  '54321' < '22222' or N'22222' < N'22221' COLLATE Albanian_CI_AI_KS
SELECT 'Albanian_CI_AI_KS_WS' where  '54321' < '22222' or N'22222' < N'22221' COLLATE Albanian_CI_AI_KS_WS
SELECT 'Albanian_CI_AS' where  '54321' < '22222' or N'22222' < N'22221' COLLATE Albanian_CI_AS

Response to your comment

Unfortunately, an exhaustive test would be unfeasible, not due to the number of collations, but due to the number of string comparisons that need to be made for each collation. Collations permit groups of characters to be considered together (e.g. “Mac” and “Mc”). Thus, '11' < '22' need not give the same result as '12' < '22'. In my case, I'd need to run 10^10 tests for each collation, as I'm dealing with 10-digit strings.

What does "Mac" and "Mc" have and to do with this? The state quesion is "There won't be cases of variable-length comparisons (e.g. '2' < '11'), so please do not address that issue."
You feel you need to test all 10 digit strings combinations?
So if you tested all 4 digit combinations you are afraid two digits in position 7 and 8 of 10 might behave differently? Really if you are that concerned then even if you found documentation that all collations must sort numeric the same that does not mean they implemented properly.
If you are that sensitive to accuracy then why is convert to numeric out of scope? I get it your question and you bounty but may ask what the use case?

What I would do is a random order table of all 5 digits and random order table of all 4 digit (so you have odd and even). Have each collation sort and compare to known good sort.
That would run in 4-8 hours on even a mid range server. This question has a 5 day bounty.

Unfortunately, an exhaustive test would be unfeasible, not due to the number of collations, but due to the number of string comparisons that need to be made for each collation. Collations permit groups of characters to be considered together (e.g. [“Mac” and “Mc”](http://en.wikipedia.org/wiki/Mac_and_Mc_together)). Thus, `'11' < '22'` need not give the same result as `'12' < '22'`. In my case, I'd need to run 10^10 tests for each collation, as I'm dealing with 10-digit strings. — Douglas, Jun 21 '14 at 14:42
Updated answer and what does "Mac" and "Mc" have to do with this. You state only digit and give explicit instructions NOT to consider variable length. — paparazzo, Jun 21 '14 at 16:11
OK but 4 digit test is not single digit. I don't mean this negative but if your are that sensitive then why is converting to numeric out of scope? — paparazzo, Jun 21 '14 at 16:49
I mentioned “Mac” and “Mc” as “groups of characters”, not as strings by themselves. I can't demonstrate the issue since SQL Server doesn't have an Irish collation, but consider the treatment of the “ch” digraph under Czech collations. `'c' < 'd'` gives true (as expected); however, `'ch' < 'dh'` gives false, since the two-character combination `ch` is treated as a single letter that goes between `h` and `i`. Try it yourself: `SELECT CASE WHEN N'ch' < N'dh' COLLATE SQL_Czech_Cp1250_CS_AS THEN 1 ELSE 0 END` — Douglas, Jun 21 '14 at 17:00
Thus, unless there is a restriction on the number of characters that may be considered together (for sorting) by a collation, then there is no guarantee that `'11111' < '22222'` implies `'1111111111' < '2222222222'`. — Douglas, Jun 21 '14 at 17:02
P.S. I'm definitely not going to award the bounty to a question that recommends an exhaustive test for this problem. I made it clear in my bounty comments that I'm looking for "an authoritative answer that draws on official documentation". — Douglas, Jun 21 '14 at 17:10
Good for you don't award. I bet I get some up votes You do know M A and C are not digits. Really no guarantee that '11111' < '22222' implies '1111111111' < '2222222222'. If you have that little confidence then why would not want exhaustive test. I am just having jun with you. I hope you get what you you looking for. — paparazzo, Jun 21 '14 at 17:12

russ · Answer 4 · 2014-06-26T17:27:20.540

1

Is it safe to assume that all collations in SQL Server will give the "expected" (i.e. numeric) sort order on strings containing integers of the same length? = yes

...or might there be some collation where @text <= '2147483647' gives unexpected results? =no

edited Jun 26 '14 at 17:27

answered Jun 26 '14 at 17:21

russ

579
3
7

Numeric sort order for strings containing same-length integers under all collations

4 Answers4

Linked