14

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. What's going on here? Am I missing something? Do other characters suffer from this same problem?

SELECT UNICODE(N'ʻ') -- Returns 699 as expected.

SELECT REPLACE(N'"ʻ', '"', '_') -- Returns "ʻ, I expected _ʻ

SELECT REPLACE(N'aʻ', 'a', '_') -- Returns aʻ, I expected _ʻ

SELECT REPLACE(N'"ʻ', N'ʻ', '_') -- Returns __, I expected "_

SELECT REPLACE(N'-', N'ʻ', '_') -- Returns -, I expected -

Also, strange when used in a LIKE for example:

DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
    @table
VALUES
    ('John'),
    ('Jane')

SELECT
    *
FROM
    @table
WHERE
    [Name] LIKE N'%ʻ%' -- This returns both records. I expected none.
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
Vqf5mG96cSTT
  • 2,561
  • 3
  • 22
  • 41
  • Excellent question that helps illustrate the shortcomings of Unicode *and* the history behind them. This should be part of the Computer History Museum! – Panagiotis Kanavos Apr 08 '19 at 07:43
  • @PanagiotisKanavos I don't mean to drag this out, but I don't see how this is a shortcoming of Unicode. Unicode provides the ability to more accurately capture a lot of the complexity of languages across a huge spectrum of variations. Microsoft's implementation of it might not be as good as ICU, but it's far better than ignoring that some languages have markings that alter the meaning and/or behavior of the character(s) before and/or after them. I would argue that this question illustrates the amazing accomplishment of Unicode (and yes, also the huge learning curve that comes with it ). – Solomon Rutzky May 08 '19 at 21:44

2 Answers2

10

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. ... Do other characters suffer from this same problem?

A few things:

  1. This is not a Hawaiian "quote": it's a "glottal stop" which affects pronunciation.
  2. It is not "weird" behavior: it's just not what you were expecting.
  3. This behavior is not specifically a "problem", though yes, there are other characters that exhibit similar behavior. For example, the following character (U+02DA Ring Above) behaves slightly differently depending on which side of a character it is on:

    SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'˚a',  N'_'); -- Returns a_a
    SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'a˚',  N'_'); -- Returns _aa
    

Now, anyone using SQL Server 2008 or newer should be using a 100 (or newer) level collation. They added a lot of sort weights and uppercase/lowercase mappings in the 100 series that aren't in the 90 series, or the non-numbered series, or the mostly obsolete SQL Server collations (those with names starting with SQL_).

The issue here is not that it doesn't equate to any other character (outside of a binary collation), and in fact it actually does equate to one other character (U+0312 Combining Turned Comma Above):

;WITH nums AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY @@MICROSOFTVERSION) - 1) AS [num]
  FROM   [master].sys.all_columns ac1
  CROSS JOIN   [master].sys.all_columns ac2
)
SELECT nums.[num] AS [INTvalue],
       CONVERT(BINARY(2), nums.[num]) AS [BINvalue],
       NCHAR(nums.[num]) AS [Character]
FROM   nums
WHERE  NCHAR(nums.[num]) = NCHAR(0x02BB) COLLATE Latin1_General_100_CI_AS;
/*
INTvalue    BINvalue    Character
699         0x02BB      ʻ
786         0x0312      ̒
*/

The issue is that this is a "spacing modifier" character, and so it attaches to, and modifies the meaning / pronunciation of, the character before or after it, depending on which modifier character you are dealing with.

According to the Unicode Standard, Chapter 7 (Europe-I), Section 7.8 (Modifier Letters), Page 323 (of the document, not of the PDF):

7.8 Modifier Letters

Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc = Mn or gc = Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, indicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound.
...

Spacing Modifier Letters: U+02B0–U+02FF

Phonetic Usage. The majority of the modifier letters in this block are phonetic modifiers, including the characters required for coverage of the International Phonetic Alphabet. In many cases, modifier letters are used to indicate that the pronunciation of an adjacent letter is different in some way—hence the name “modifier.” They are also used to mark stress or tone, or may simply represent their own sound.

 
The examples below should help illustrate. I am using a level 100 collation, and it needs to be accent-sensitive (i.e. name contains _AS):

SELECT REPLACE(N'ʻ'    COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns _
SELECT REPLACE(N'ʻa'   COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns _a
SELECT REPLACE(N'ʻaa'  COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns __aa

SELECT REPLACE(N'ʻaa'  COLLATE Latin1_General_100_CI_AS, N'ʻa',  N'_'); -- Returns ʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa',  N'_'); -- Returns aʻ__

SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻ',  N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻa', N'_'); -- Returns _a

SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'a',   N'_'); -- Returns aʻ__
SELECT REPLACE(N'אʻaa' COLLATE Latin1_General_100_CI_AS, N'א',   N'_'); -- Returns אʻaa
SELECT REPLACE(N'ffʻaa' COLLATE Latin1_General_100_CI_AS, N'ff',   N'_'); -- Returns ffʻaa
SELECT REPLACE(N'ffaa'  COLLATE Latin1_General_100_CI_AS, N'ff',   N'_'); -- Returns _aa



SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AS); -- 3
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AI); -- 1



SELECT 1 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AS; -- (0 rows returned)
SELECT 2 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AI; -- 2

If you need to deal with such characters in a way that ignores their intended linguistic behavior, then yes, you must use a binary collation. In such cases, please use the most recent level of collation, and BIN2 instead of BIN (assuming you are using SQL Server 2005 or newer). Meaning:

  • SQL Server 2000: Latin1_General_BIN
  • SQL Server 2005: Latin1_General_BIN2
  • SQL Server 2008, 2008 R2, 2012, 2014, and 2016: Latin1_General_100_BIN2
  • SQL Server 2017 and newer: Japanese_XJIS_140_BIN2

If you are curious why I make that recommendation, please see:

Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)

And, for more information on collations / Unicode / encodings / etc, please visit: Collations Info

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • 1
    Fantastically thorough answer. Wish I could up vote twice. – HoneyBadger Apr 04 '19 at 20:37
  • 1
    A real multilingual planeswalker for sure. I'm still baffled by your recommendation to use `Japanese_XJIS_140_BIN2` for 2017+. I'm not seeing a reference to that in your (excellent) article. – TT. Apr 05 '19 at 03:41
  • 1
    @TT. Hi, and thanks :). The reasoning is in the 2nd section, titled "**Different Versions**". Basically, there are 211 more uppercase / lowercase mappings than the version 100 collations. And, it's only the Japanese collations that have a version 140 set, else I would go with `Latin1_General_140_*` if such a thing existed... – Solomon Rutzky Apr 05 '19 at 04:41
  • @HoneyBadger Thanks! I appreciate the sentiment :-). – Solomon Rutzky Apr 05 '19 at 05:56
2

I cannot provide a detailed answer, but i can provide a solution to fulfill your expectations.

This has to do with collations, though I'm not sure why the Windows collations give unexpected results. If you use a binary collation, you get expected results (see Solomons excellent answer for which BIN to use):

SELECT REPLACE(N'aʻ' COLLATE Latin1_General_BIN, N'a', N'_') 

Returns

DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
    @table
VALUES
    (N'John'),
    (N'Jane'),
    (N'Hawaiʻi'),
    (N'Hawai''i'),
    (NCHAR(699))

SELECT
    *
FROM
    @table
WHERE
    [Name] like N'%ʻ%' COLLATE Latin1_General_BIN

Returns:

Hawaiʻi
ʻ

You can check which collation confirms your expectations with the following code (Adapted from code by @SolomonRutzky (source)). It evaluates SELECT REPLACE(N'"ʻ', N'ʻ', N'_')) = '"_' for all collations:

DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1;';

SELECT @SQL += REPLACE(N'
  IF((SELECT REPLACE(N''"ʻ'' COLLATE {Name}, N''ʻ'', N''_'')) = ''"_'')
  BEGIN
    RAISERROR(N''%4d.  {Name}'', 10, 1, @Counter) WITH NOWAIT;
    SET @Counter += 1;
  END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM   sys.fn_helpcollations() col
ORDER BY col.[name]

--PRINT @SQL;
EXEC (@SQL);
HoneyBadger
  • 14,750
  • 3
  • 34
  • 48
  • 1
    [This should give a hint](https://geoffrichards.co.uk/blog/2013/08/hawaiian-collation/) , plus the [misuse of the character](https://en.wikipedia.org/wiki/%CA%BBOkina#Unicode) for transliterations of other languages. Only binary sort orders recognize this character as a separate character – Panagiotis Kanavos Apr 01 '19 at 14:18
  • 1
    @PanagiotisKanavos , regarding "_Only binary sort orders recognize this character as a separate character_": that is not true. Nor is it the issue. Unicode allows for some character to behave differently depending on context, so testing them individually won't show how they truly behave. Please see [my answer](https://stackoverflow.com/a/55524337/577765) for details. Also, HoneyBadger, please see the note at the end of my answer regarding which binary collation to use :-). Take care... – Solomon Rutzky Apr 04 '19 at 20:30
  • @SolomonRutzky I used *your* SQL script to find that only binary collations recognize the character - *any* binary collation, irrespective of language. *No* non-binary collation was returned, so according to that script, only binary collations will work – Panagiotis Kanavos Apr 05 '19 at 07:40
  • @PanagiotisKanavos Please be respectful enough to not be snippy (I wasn't rude to you), and to read my answer before arguing about what I said. It explains what's going on with this character and why HoneyBadger's adaptation of my query only addresses one aspect of this situation. I assume you mean the query in this answer, as the one in my answer shows what I am talking about. And, if saying "_irrespective of language_" means you think it's possible that culture could apply to binary comparisons of Unicode, then please read the post linked at the end of my answer to understand why it's not. – Solomon Rutzky Apr 05 '19 at 17:16
  • @SolomonRutzky I wasn't snippy and simply replied with what I found. The collations that recognized the character were *all* the BINx collations. Nice script by the way, it can save a *lot* of time in similar questions. As for the actual issue, the first link shows [that the character is considered significant in Hawaiian, although it wasn't in the past](https://geoffrichards.co.uk/blog/2013/08/hawaiian-collation/). The [Wikipedia article](https://en.wikipedia.org/wiki/%CA%BBOkina#Unicode) doesn't contradict your answer, it explains the confusion that lead to this and other similar questions – Panagiotis Kanavos Apr 08 '19 at 07:35
  • @SolomonRutzky your answer explains why the existing collations behave the way they do, even though .NET for example doesn't. This *is* a grey area, any way you look at it. There's no Hawaiian collation in SQL Server, no Hawaiian culture in .NET, [not even an ISO two-letter code](https://simple.wikipedia.org/wiki/Hawaiian_language). – Panagiotis Kanavos Apr 08 '19 at 07:38
  • @SolomonRutzky long story short, this, along with the Turkish `I` should be taught at college – Panagiotis Kanavos Apr 08 '19 at 09:42
  • @PanagiotisKanavos Sorry for the delayed response. To clarify: 1) there is no need to specify "_all_ BINx collations" since they are all identical in this respect. 2) I did review those links when this was first asked. Thanks for those. 3) .NET by default uses ordinal comparisons (i.e. binary), but that can be changed. 4) No Hawaiin collation in SQL Server, but .NET definitely has it as of Windows 8 / Windows Server 2012 ( https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c ). ISO code "haw" for LCID 117, and "haw-US" for LCID 1141. – Solomon Rutzky May 08 '19 at 20:43
  • Yes, this should be taught. But what should be taught is basic elements of languages and cultures in general, because the variations go well beyond this 'Okina character. Case in point: the "Turkish" `İ` that you mention isn't even that simple. There is a dotless lowercase version `ı`, and both dotted-uppercase and dotless lowercase are also found in Azerbaijani / Azeri. I have updated my "Differences Between the Various Binary Collations" post with this info. This is why I am writing articles and presenting on this topic, and might even have a self-published book out some time next year :). – Solomon Rutzky May 08 '19 at 20:43