3

Overview

CHARINDEX is returning the wrong value occasionally when using a collation sequence like:

Latin1_General_CI_AS 

but works with a collation sequence like:

SQL_Latin1_General_CP1_CI_AS

This has been encountered on MS SQL Server 2008 R2 and SQL Server 2016.

Examples

Assume the database collation sequence is:

Latin1_General_CI_AS
  • print CHARINDEX( CHAR(254), 'Tþ' ) -- returns 2 which is Correct
  • print CHARINDEX( CHAR(254), 'Th' ) -- returns 1 which is WRONG
  • print CHARINDEX( CHAR(253), 'Th' ) -- returns 0 which is Correct
  • print CHARINDEX( CHAR(254) Collate SQL_Latin1_General_CP1_CI_AS, 'Thþ' Collate SQL_Latin1_General_CP1_CI_AS) -- returns 3 which is Correct
  • print CHARINDEX( CHAR(254) Collate Latin1_General_CI_AS, 'Thþ' Collate Latin1_General_CI_AS) -- returns 1 which is WRONG

Is there a known error with the Latin1... collation sequences?

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
Gazza
  • 41
  • 5

2 Answers2

2

This will return the correct result:

select CHARINDEX( NCHAR(254) Collate Latin1_General_BIN2, N'Tþ' Collate Latin1_General_BIN2)
select CHARINDEX( NCHAR(254) Collate Latin1_General_BIN2, N'Th' Collate Latin1_General_BIN2 )
select CHARINDEX( NCHAR(253) Collate Latin1_General_BIN2, N'Th' Collate Latin1_General_BIN2 )
select CHARINDEX( NCHAR(254) Collate Latin1_General_BIN2, N'Thþ' Collate Latin1_General_BIN2)

Documentation says:

Using Binary Collations

The following considerations will help you to decide whether old or new binary collations are appropriate for your Microsoft SQL Server implementation. Support for both BIN and BIN2 collations will continue in future SQL Server releases.

Binary collations sort data based on the sequence of coded values defined in a particular code page. A binary collation in SQL Server defines the language locale and the ANSI code page to be used, enforcing a binary sort order. Binary collations are useful in achieving improved application performance due to their relative simplicity.

Previous binary collations in SQL Server performed an incomplete code-point-to-code-point comparison for Unicode data, in that older SQL Server binary collations compared the first character as WCHAR, followed by a byte-by-byte comparison. For backward compatibility reasons, existing binary collation semantics will not be changed.

Guidelines for Using Binary Collations

If your Microsoft SQL Server 2005 applications interact with older versions of SQL Server that use binary collations, continue to use binary. Binary collations may be a more suitable choice for mixed environments.

Guidelines for Using BIN2 Collations

Binary collations in this release of SQL Server include a new set of pure code-point comparison collations. Customers can choose to migrate to the new binary collations to take advantage of true code-point comparisons, and they should utilize the new binary collations for development of new applications. The new BIN2 suffix identifies collation names that implement the new code-point collation semantics. In addition, a new comparison flag is added corresponding to BIN2 for the new binary sort. Advantages include simpler application development and clearer semantics.

IE. BIN2 collation is equivalent to using Ordinal in C# wrt to charindex.

Cine
  • 4,255
  • 26
  • 46
  • Thanks for the suggestion. – Gazza Oct 20 '16 at 21:30
  • Unfortunately we have to use the collation Latin1_General_CI_AS in this database. I am keen to know why the Latin1_General_CI_AS collation sequence does not behave correctly with CHARINDEX and CHAR(254) – Gazza Oct 20 '16 at 21:32
  • 1
    You can override collation in all queries, doesnt matter what the data is stored in. E.g. we override it for all order by to get per-language sorting – Cine Oct 21 '16 at 03:25
  • @Gazza I've seen plenty of problems with this sort of operations, and the solution is usually go for the _BIN (or _BIN2) collation. As Cine already said, no need to change at the database or table level, you can specify this on the statement level. – TT. Oct 22 '16 at 10:55
2

This isn't anything specific to SQL Server.

In C#

string.Compare("þ", "th", false, new System.Globalization.CultureInfo(1033))

returns 0 indicating the strings compare equal.

Or in notepad clicking "Replace all" below

enter image description here

leads to

enter image description here

In SQL Server collations that do not start with "SQL" use windows collation rules.

For those in most locales (Iceland being an exception) the thorn character þ expands to th.

There is more information about this specific case in this post by Michael S. Kaplan Every rose has it's Þ..... That blog has a wealth of information about windows collations. Expansions are described in more detail here.

If you don't want these semantics you need to use a collation (likely SQL or binary) with the semantics that you do want (perhaps via an explicit collate clause).

Martin Smith
  • 438,706
  • 87
  • 741
  • 845