0

I'm wondering which collation should I use to get a consistent sort for a nvarchar column containing text and emojis. In any case the expected result is something like this (sorted by MS Excel):

enter image description here

First try: SQL_Latin1_General_CP1_CI_AI

SELECT Val
FROM (VALUES
    (N'⭐⭐⭐'),
    (N'⭐⭐⭐'),
    (N''),
    (N'⭐⭐'),
    (N'⭐⭐'),
    (N''),
    (N'⭐'),
    (N'❗❗'),
    (N'❗❗'),
    (N'❗'),
    (N'❗'),
    (N'❗❗❗'),
    (N'❗❗❗'),
    (N'bb'),
    (N'ab'),
    (N'aa')
) AS A (Val)
ORDER BY Val COLLATE SQL_Latin1_General_CP1_CI_AI

Result (not as excpected):

enter image description here

Second try (based on this answer https://stackoverflow.com/a/47551803/2336493): Latin1_General_100_CI_AS_SC

SELECT Val
FROM (VALUES
    (N'⭐⭐⭐'),
    (N'⭐⭐⭐'),
    (N''),
    (N'⭐⭐'),
    (N'⭐⭐'),
    (N''),
    (N'⭐'),
    (N'❗❗'),
    (N'❗❗'),
    (N'❗'),
    (N'❗'),
    (N'❗❗❗'),
    (N'❗❗❗'),
    (N'bb'),
    (N'ab'),
    (N'aa')
) AS A (Val)
ORDER BY Val COLLATE Latin1_General_100_CI_AS_SC;

Result (not as excpected):

enter image description here

Has anyone dealt with this?

gsubiran
  • 2,012
  • 1
  • 22
  • 33
  • 1
    Why do you assume this is the correct order? Why would ASCII strings be sorted after *any* emoji? `aa`-`bb` should *always* come before any text starting with an emoji. In any case, no language has emojis, so no normal collation should be expected to sort them. – Panagiotis Kanavos Sep 08 '21 at 16:34
  • @PanagiotisKanavos I agree with you but this order was the most accurate I found at the moment of writing the question. – gsubiran Sep 08 '21 at 16:43
  • 1
    It's not accurate at all. ASCII letters can't come after emojis that were added 40 years later. Excel is using its own rules in this case, perhaps treating all values as numbers through some strange transformation – Panagiotis Kanavos Sep 08 '21 at 17:04
  • When I try these emojis in Excel for Mac the emojis *always* come before letters and after numbers. They also come before digits treated as text, ie `'0`, `'1`, `'2`. The flags *don't* appear after the letters. So Excel *does* treat emojis as numbers but sorts them *after* actual numbers. That's specific to Excel, and probably done because Excel uses such glyphs as indicators in dashboards – Panagiotis Kanavos Sep 08 '21 at 17:13
  • I tried with some more emojis and Excel's order isn't very consistent, but it does suggest emojis are treated as "numbers". I posted the results – Panagiotis Kanavos Sep 08 '21 at 17:26

2 Answers2

2

Try a binary collation. EG

SELECT Val
FROM (VALUES
    (N'⭐⭐⭐'),
    (N'⭐⭐⭐'),
    (N''),
    (N'⭐⭐'),
    (N'⭐⭐'),
    (N''),
    (N'⭐'),
    (N'❗❗'),
    (N'❗❗'),
    (N'❗'),
    (N'❗'),
    (N'❗❗❗'),
    (N'❗❗❗'),
    (N'bb'),
    (N'ab'),
    (N'aa')
) AS A (Val)
ORDER BY Val COLLATE Latin1_General_100_BIN2

outputs

Val
----
aa
ab
bb
❗
❗
❗❗
❗❗
❗❗❗
❗❗❗
⭐
⭐⭐
⭐⭐
⭐⭐⭐
⭐⭐⭐



(16 rows affected)
David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67
1

It makes no sense to order aa-bb after any string that starts with an emoji. ASCII letters like a and b come first in all collations.

Emojis aren't part of any language so normal collations aren't expected to sort them in any specific way. You need to use a binary collation instead (_BIN2) which sorts characters according to their codepoint values. For example :

SELECT Val
FROM (VALUES
    (N'⭐⭐⭐'),
    (N'⭐⭐⭐'),
    (N''),
    (N'⭐⭐'),
    (N'⭐⭐'),
    (N''),
    (N'⭐'),
    (N'❗❗'),
    (N'❗❗'),
    (N'❗'),
    (N'❗'),
    (N'❗❗❗'),
    (N'❗❗❗'),
    (N'bb'),
    (N'αα'),
    (N'ab'),
    (N'aa')
) AS A (Val)
ORDER BY Val COLLATE Greek_BIN2;

And

ORDER BY Val COLLATE Latin1_General_100_BIN2;

Produce

aa
ab
bb
αα
❗
❗
❗❗
❗❗
❗❗❗
❗❗❗
⭐
⭐⭐
⭐⭐
⭐⭐⭐
⭐⭐⭐


Why is Excel doing ?

It looks like Excel is treating those glyphs as numbers but sorts them after actual numbers. Text with digits is sorted after emojis too.

I suspect the emojis are treated as numbers because Excel uses such glyphs as indicators in dashboards.

The order isn't very consistent though, with emojis that appear later in the Unicode table getting sorted before stars and flags. In Excel For Mac, ordering in ascending order produced this image :

enter image description here

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • A binary collation in fact *only* sorts by code point; any language collation rules are ignored (compare `select * from (values (N'rôle'), (N'roles')) _(x) ORDER BY x COLLATE French_BIN2/French_CI_AS`). The difference between binary collations of different languages is that they also determine how strings are stored in `[VAR]CHAR` types, but for Unicode types, all binary collations are very nearly equivalent. (Nearly because they still influence things like lowercasing and uppercasing, which can vary between languages -- but not sorting.) – Jeroen Mostert Sep 08 '21 at 17:02
  • @JeroenMostert oops. – Panagiotis Kanavos Sep 08 '21 at 17:13
  • I found that both collations you said are excellent to sort emoji characters. `Latin1_General_100_BIN2` is also case and accent sensitive. `Greek_BIN2` is case sensitive but accent insensitive. I need both insensitive so `Greek_BIN2` is the closest solution for my case. – gsubiran Sep 08 '21 at 19:09
  • 1
    @gsubiran: `Greek_BIN2` is not accent insensitive. If you are using Unicode, there's no difference. If you are not using Unicode, you get a different order because Greek has no encoding for accented Latin characters (among other things, notably including emoji), so these get converted *before* ordering. This may superficially look like accent insensitivity but it's quite a different thing! [Demo](https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=2b2e895bebe4cf898f855d6d2dbc28bd). Relying on how SQL Server converts unrepresentable characters is not a good idea. – Jeroen Mostert Sep 09 '21 at 14:11
  • 1
    @gsubiran I used `Greek_BIN2` simply as an example that the language doesn't matter when it comes to emojis. Any `_BIN2` should behave the same **provided you use Unicode*. Otherwise you won't be able to store emojis at all. The single-byte codepages simply don't have space for emojis and the the emojis will almost certainly be replaced by `?`. – Panagiotis Kanavos Sep 09 '21 at 14:16