24

I have the defined a function dbo.ufnRemoveSpaceCharacters to remove all characters under the Unicode category Zs (Space Characters). All 17 of the characters able to be replaced/translated using REPLACE and TRANSLATE when they are the only character in the string (NCHAR(1) or NVARCHAR(1)), but when the characters are combined into a string, such as a a (where the space/dash is the Ogham Space Mark 0x1680), the character is NOT replaced. The following characters are not replaced/translated Ogham Space Mark 0x1680, Narrow No Break Space 0x202F, Medium Mathematical Space 0x205F.

DROP FUNCTION IF EXISTS dbo.ufnRemoveSpaceCharacters
GO

CREATE FUNCTION dbo.ufnRemoveSpaceCharacters(@Value NVARCHAR(MAX))  
RETURNS NVARCHAR(MAX)
AS   
BEGIN  
    DECLARE @Space NCHAR = NCHAR(0x0020);
    DECLARE @NoBreakSpace NCHAR = NCHAR(0x00A0);
    DECLARE @OghamSpaceMark NCHAR = NCHAR(0x1680);
    DECLARE @EnQuad NCHAR = NCHAR(0x2000);
    DECLARE @EmQuad NCHAR = NCHAR(0x2001);
    DECLARE @EnSpace NCHAR = NCHAR(0x2002);
    DECLARE @EmSpace NCHAR = NCHAR(0x2003);
    DECLARE @ThreePerEmSpace NCHAR = NCHAR(0x2004);
    DECLARE @FourPerEmSpace NCHAR = NCHAR(0x2005);
    DECLARE @SixPerEmSpace NCHAR = NCHAR(0x2006);
    DECLARE @FigureSpace NCHAR = NCHAR(0x2007);
    DECLARE @PunctuationSpace NCHAR = NCHAR(0x2008);
    DECLARE @ThinSpace NCHAR = NCHAR(0x2009);
    DECLARE @HairSpace NCHAR = NCHAR(0x200A);
    DECLARE @NarrowNoBreakSpace NCHAR = NCHAR(0x202F);
    DECLARE @MediumMathematicalSpace NCHAR = NCHAR(0x205F);
    DECLARE @IdeographicSpace NCHAR = NCHAR(0x3000);

    DECLARE @SpaceCharacters NVARCHAR(17) =
        @Space + @NoBreakSpace + @OghamSpaceMark + @EnQuad 
      + @EmQuad + @EnSpace + @EmSpace + @ThreePerEmSpace 
      + @FourPerEmSpace + @SixPerEmSpace + @FigureSpace 
      + @PunctuationSpace + @ThinSpace + @HairSpace 
      + @NarrowNoBreakSpace + @MediumMathematicalSpace + @IdeographicSpace;

    DECLARE @ReplacePhrase NVARCHAR(17) = N'aaaaaaaaaaaaaaaaa';

    RETURN REPLACE(TRANSLATE(@Value, @SpaceCharacters, @ReplacePhrase), @Space,  '')
END
GO

You can reproduce this by using the following

DECLARE @Space NCHAR = NCHAR(0x0020);
DECLARE @NoBreakSpace NCHAR = NCHAR(0x00A0);
DECLARE @OghamSpaceMark NCHAR = NCHAR(0x1680);
DECLARE @EnQuad NCHAR = NCHAR(0x2000);
DECLARE @EmQuad NCHAR = NCHAR(0x2001);
DECLARE @EnSpace NCHAR = NCHAR(0x2002);
DECLARE @EmSpace NCHAR = NCHAR(0x2003);
DECLARE @ThreePerEmSpace NCHAR = NCHAR(0x2004);
DECLARE @FourPerEmSpace NCHAR = NCHAR(0x2005);
DECLARE @SixPerEmSpace NCHAR = NCHAR(0x2006);
DECLARE @FigureSpace NCHAR = NCHAR(0x2007);
DECLARE @PunctuationSpace NCHAR = NCHAR(0x2008);
DECLARE @ThinSpace NCHAR = NCHAR(0x2009);
DECLARE @HairSpace NCHAR = NCHAR(0x200A);
DECLARE @NarrowNoBreakSpace NCHAR = NCHAR(0x202F);
DECLARE @MediumMathematicalSpace NCHAR = NCHAR(0x205F);
DECLARE @IdeographicSpace NCHAR = NCHAR(0x3000);

DECLARE @SpaceCharacters NVARCHAR(17) =
    @Space + @NoBreakSpace + @OghamSpaceMark + @EnQuad 
    + @EmQuad + @EnSpace + @EmSpace + @ThreePerEmSpace 
    + @FourPerEmSpace + @SixPerEmSpace + @FigureSpace 
    + @PunctuationSpace + @ThinSpace + @HairSpace 
    + @NarrowNoBreakSpace + @MediumMathematicalSpace + @IdeographicSpace;

SELECT REPLACE(@OghamSpaceMark, @OghamSpaceMark, 'a'), 
       TRANSLATE(@OghamSpaceMark, @OghamSpaceMark, 'a'),
       REPLACE(@Space + @OghamSpaceMark, @OghamSpaceMark, 'a'), 
       TRANSLATE(@Space + @OghamSpaceMark + @Space, @OghamSpaceMark, 'a'),
       dbo.ufnRemoveSpaceCharacters(@SpaceCharacters),
       REPLACE(@SpaceCharacters, @OghamSpaceMark, 'a')

with output

(No column name) (No column name) (No column name) (No column name) (No column name) (No column name)
a a a aa aaaaaaaaaaa a                

I also ran the following to provide additional info

SELECT os_language_version,
       SERVERPROPERTY('LCID') AS 'Instance-LCID',
       SERVERPROPERTY('Collation') AS 'Instance-Collation',
       SERVERPROPERTY('ComparisonStyle') AS 'Instance-ComparisonStyle',
       SERVERPROPERTY('SqlSortOrder') AS 'Instance-SqlSortOrder',
       SERVERPROPERTY('SqlSortOrderName') AS 'Instance-SqlSortOrderName',
       SERVERPROPERTY('SqlCharSet') AS 'Instance-SqlCharSet',
       SERVERPROPERTY('SqlCharSetName') AS 'Instance-SqlCharSetName',
       DATABASEPROPERTYEX(N'{database_name}', 'LCID') AS 'Database-LCID',
       DATABASEPROPERTYEX(N'{database_name}', 'Collation') AS 'Database-Collation',
       DATABASEPROPERTYEX(N'{database_name}', 'ComparisonStyle') AS 'Database-ComparisonStyle',
       DATABASEPROPERTYEX(N'{database_name}', 'SQLSortOrder') AS 'Database-SQLSortOrder'
FROM sys.dm_os_windows_info;

with output

os_language_version Instance-LCID Instance-Collation Instance-ComparisonStyle Instance-SqlSortOrder Instance-SqlSortOrderName Instance-SqlCharSet Instance-SqlCharSetName Database-LCID Database-Collation Database-ComparisonStyle Database-SQLSortOrder
1033 1033 SQL_Latin1_General_CP1_CI_AS 196609 52 nocase_iso 1 iso_1 NULL NULL NULL NULL
Collin Stevens
  • 737
  • 9
  • 19
  • 2
    You have `DECLARE @ReplacePhrase NVARCHAR(17) = 'aaaaaaaaaaaaaaaaa';` did you mean `... = N'aaa ...` ? Missing that N prefix can sometimes implicitly coerce varchar behavior. – Aaron Bertrand Feb 03 '21 at 00:13
  • @AaronBertrand it fails either way. – Collin Stevens Feb 03 '21 at 00:14
  • @CollinStevens you need a line space between the previous sentence and the table - its a right pain because as you say it looks correct and then isn't. – Dale K Feb 03 '21 at 00:58
  • @DaleK thank you, I thought it was because it was outputting with the special characters, so I replaced them with regular spaces, but it made no difference. Yeah, confusing that the editor displays it correctly, but it renders incorrectly on the question view. – Collin Stevens Feb 03 '21 at 01:00
  • That's going to do my head in, especially because replace() and translate() replace U+3000 with the replacement character for U+0020 even when Unicode() says it's U+3000. – AlwaysLearning Feb 03 '21 at 09:32
  • 13
    In the last line of your function, add `COLLATE Latin1_General_BIN2` after `@Value`. Welcome to the wonderful wacky world of how collations treat character clusters; fortunately the binary collations just do as they're told. (Unfortunately it's not permitted to mark parameters or output types with a `COLLATE` clause, as this would arguably be much easier than having to figure out exactly where it has to go in every other spot.) – Jeroen Mostert Feb 03 '21 at 20:53
  • 2
    @JeroenMostert this actually worked. I don't know why, can you point me to a resource? Also, did you want to post this as an answer so I can mark it as such? – Collin Stevens Feb 03 '21 at 22:49
  • I suspect this is related to varied interpretation around https://unicode.org/reports/tr15/ – Jodrell Aug 09 '22 at 11:28
  • When you are able to use `COLLATE` and depending on the sql version there are `_SC` (Supplementary Characters) or `_UTF8` collations. Because you already have the supplied list of values to find/replace, the `BIN2` is working. What you may get into trouble with when using binary, is `SUBSTRING` and `SPLIT` functions that need to count the additional codepoint/"surrogate pair" as a single "character". See example in an [answer i supplied](https://stackoverflow.com/a/74075009/20187370) regarding `TRANSLATE` where I had luck using the `Latin1_General_100_CI_AS_KS_WS_SC` collation. – Sean Mar 07 '23 at 14:35

1 Answers1

1

@JeroenMostert Answered this with:

In the last line of your function, add COLLATE Latin1_General_BIN2 after @Value. Welcome to the wonderful wacky world of how collations treat character clusters; fortunately the binary collations just do as they're told.

(Unfortunately it's not permitted to mark parameters or output types with a COLLATE clause, as this would arguably be much easier than having to figure out exactly where it has to go in every other spot.)

APB Reports
  • 985
  • 10