5

I have a table that contains the names of various recording artists. One of them has a dash in their name. If I run the following:

Select artist
, substring(artist,8,1) as substring_artist
, ascii(substring(artist,8,1)) as ascii_table
, ascii('-') as ascii_dash_key /*The dash key next to zero */
, len(artist) as len_artist 
From [dbo].[mytable] where artist like 'Sleater%'

Then the following is returned. This seems to indicate that a dash (ascii 45) is being stored in the artist column

enter image description here

However, if I change the where clause to:

From [dbo].[mytable] where artist like 'Sleater' + char(45) + '%'

I get no results returned. If I copy and paste the output from the artist column into a hex editor, I can see that the dash is actually stored as E2 80 90, the Unicode byte sequence for the multi-byte hyphen character.

enter image description here

So, I'd like to find and replace such occurrences with a standard ascii hyphen, but I'm am at a loss as to what criteria to use to find these E2 80 90 hyphens?

basinbasin
  • 419
  • 1
  • 5
  • 12
  • 1
    Don't forget that there is the `hyphen-minus` character (unicode 0x002D), and there is the `en-dash` character (unicode 0x2013) which look very similar. See [Unicode® character table](https://unicode-table.com/en/) for reference. – TT. Aug 09 '18 at 10:15
  • 1
    Replacing your uses of `ASCII` with [`UNICODE`](https://learn.microsoft.com/sql/t-sql/functions/unicode-transact-sql) should shed more light on things. – Jeroen Mostert Aug 09 '18 at 10:15
  • The character that is actually stored in the DB is `U+2010 HYPHEN`, not `U+002D HYPHEN-MINUS`. Your use of `ASCII()` is transliterating `U+2010` into `U+002D` per the Unicode standard – Remy Lebeau Aug 13 '18 at 00:03

2 Answers2

3

Your char is the hyphen, information on it here :

https://www.charbase.com/2010-unicode-hyphen

You can see that the UTF16 code is 2010 so in T-SQL you can build it with

SELECT NCHAR(2010) 

From there you can use any SQL command with that car, for example in a select like :

Select artist
From [dbo].[mytable] where artist like N'Sleater' + NCHAR(2010) + '%'

or as you want in a

REPLACE( artist, NCHAR(2010), '-' )

with a "real" dash

EDIT:

If the collation of your DB give you some trouble with the NCHAR(2010) you can also try to use the car N'‐' that you'll copy/paste from the charbase link I gave you so :

REPLACE( artist , N'‐' , '-'  ) 

that you can even take from the string here (made with the special car) so all made for you :

update mytable set artist=REPLACE( artist, N'‐' , '-' )
Cali
  • 311
  • 1
  • 7
2

I don't know your table definition and COLLATION but I'm almost sure that you are mixing NCHAR and CHAR types and convert unicode, multibyte characters to sinle byte representations. Take a look at this demo:

WITH Demo AS
(
  SELECT N'ABC'+NCHAR(0x2010)+N'DEF' T
)
SELECT
  T,
  CASE WHEN T LIKE 'ABC'+CHAR(45)+'%' THEN 1 ELSE 0 END [Char],
  CASE WHEN T LIKE 'ABC-%' THEN 1 ELSE 0 END [Hyphen],
  CASE WHEN T LIKE N'ABC‐%' THEN 1 ELSE 0 END [Unicode-Hyphen],--unicode hyphen us used here
  CASE WHEN T LIKE N'ABC'+NCHAR(45)+N'%' THEN 1 ELSE 0 END [NChar],
  CASE WHEN CAST(T AS varchar(MAX)) LIKE 'ABC-%' THEN 1 ELSE 0 END [ConvertedToAscii],
  ASCII(NCHAR(0x2010)) ConvertedToAscii,
  CAST(SUBSTRING(T, 4, 1) AS varbinary) VarbinaryRepresentation
FROM Demo

My results:

T       Char        Hyphen      Unicode-Hyphen NChar       ConvertedToAscii ConvertedToAscii VarbinaryRepresentation
------- ----------- ----------- -------------- ----------- ---------------- ---------------- --------------------------------------------------------------
ABC‐DEF 0           0           1              0           1                45               0x1020

UTF-8 (3 bytes) representation is the same as 2010 in unicode.

Paweł Dyl
  • 8,888
  • 1
  • 11
  • 27