0

Just wrote simple library that replace polish diacritic signs with normal ones in given string, for example ą->a, ć->c etc.

The problem is - when I call function from sql like this:

select MyReplace('ąśćźń') 

it gives me correct string - 'asczn'.

But when I call it like this

select Something, MyReplace(Something2) from Somewhere

it doesn't change anything, but function in dll executes, I checked it.

Here is library code:

[SqlFunction()]
    public static SqlString Replace(SqlString toReplace)
    {
        String nowy = toReplace.ToString();
        nowy.Replace('ą', 'a');
        nowy.Replace('Ą', 'A');
        nowy.Replace('ć', 'c');
        nowy.Replace('Ć', 'C');
        nowy.Replace('ę', 'e');
        nowy.Replace('Ę', 'E');
        nowy.Replace('ł', 'l');
        nowy.Replace('Ł', 'L');
        nowy.Replace('ń', 'n');
        nowy.Replace('Ń', 'N');
        nowy.Replace('ó', 'o');
        nowy.Replace('Ó', 'O');
        nowy.Replace('ś', 's');
        nowy.Replace('Ś', 'S');
        nowy.Replace('ż', 'z');
        nowy.Replace('Ż', 'Z');
        nowy.Replace('ź', 'z');
        nowy.Replace('Ź', 'Z');
        return new SqlString(nowy);
    }

And how I define function in SQL Server:

create Function MyReplace(@text nvarchar(max))
returns nvarchar(max)
AS
    External name sql_bezpolskich.[sql_bezpolskich.CLRFunctions].Replace

What else I noticed is that when I convert input to varchar, it replaces all except 'ó'. My guess is encoding issue, but can't find exactly what's wrong.

Thank you!

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
Patryk Lipski
  • 163
  • 10
  • What is de datatype of "something2"? – Tristan Aug 05 '14 at 12:01
  • Let's assume nvarchar(max) – Patryk Lipski Aug 06 '14 at 11:04
  • Ok I'm just guessing here to be honest but would you try the following: cast your string in the query to nvarchar, and to also test the collation look up the database collation and the use the collate command on the string. If I'm on to something one of those 2 will give the same bad result as applying the function to the column value – Tristan Aug 06 '14 at 17:38
  • @Tristan: you were heading in the right direction. See my answer below for the explanation. – Solomon Rutzky Sep 16 '14 at 15:02

1 Answers1

4

Yes, this is an issue of encoding. By default, strings in SQL Server are ASCII (one byte; values 0 - 255) unless denoted with a preceding capital "N" (lower-case "n" doesn't work). If the string literal is preceded by a capital-"N", then it is handled as UNICODE (technically UCS-2; two-bytes; values 0 - 65535).

The following SQL will show the difference for your test string in terms of the default VARCHAR (i.e. no "N" preceding the string literal) and NVARCHAR (i.e. string literal preceded by an "N"):

SELECT 'ąśćźń' AS [OriginalAsVARCHAR],
       N'ąśćźń' AS [OriginalAsNVARCHAR],
       CONVERT(NVARCHAR(50), 'ąśćźń') AS [VARCHARConvertToNVARCHAR],
       ASCII('ą') AS [AsciiValueOf_ą],
       UNICODE(N'ą') AS [UnicodeValueOf_ą];

Returns:

AsVARCHAR   AsNVARCHAR  VARCHARConvertToNVARCHAR    AsciiValueOf_ą  UnicodeValueOf_ą
asczn       ąśćźń       asczn                       97              261

As you can see in the first field, the literal that you were passing to the CLR function (i.e. MyReplace('ąśćźń')) had already been replaced before it was sent into the function. Hence, the CLR function was not doing anything, which explains why it in fact did nothing when you passed in the real data. The third field above shows what happened as you passed the string literal into the function in the test that appeared to work; converting it to NVARCHAR didn't help as SQL Server couldn't store those characters in plain ''. If you were to test by putting an "N" in front of the literal as it is passed in (i.e. MyReplace(N'ąśćźń')) you would be testing what really happens when the NVARCHAR(MAX) field Something2 is passed in.

The final two fields above show the difference in the underlying value. The "AsciiValue" field is 97 because it had already transformed it into "a". The "UnicodeValue" field, however, shows the actual value.

So, most of the work of the desired replacements is done for you by SQL Server. There really is no reason at all to have a SQLCLR function here. I am all for SQLCLR, but in this case it provides no benefit and will just be slower than a pure T-SQL solution.

The only items not taken care of by the conversion to VARCHAR are the Ó and ó. The reason for this is that those two characters already fit into the ASCII range as their values are under 256. You can see that by running the following. Please keep in mind that ASCII values 128 - 255 map to different characters based on code pages. Since the two characters here fit within that 128 - 255 range, it will not always be the case that they even have a mapping within the ASCII range. However, in the case that they don't they should be handled automatically in the CONVERT just as the rest of the characters were. But that also means that any of those other characters might then have a mapping in whatever code page is in use, in which case those characters and not the Ó and ó will need to be handled via REPLACE.

SELECT ASCII(N'Ó') AS [AsciiValueOf Ó],
       UNICODE(N'Ó') AS [UnicodeValueOf Ó];

Returns:

AsciiValueOf Ó  UnicodeValueOf Ó
211             211

In order to fix those two remaining characters, we just use two simple calls to the T-SQL REPLACE function. The following SQL demonstrates, for the entire replacement set noted in your C# function, each of the three steps and then all of them together. Please note the use of COLLATE Latin1_General_BIN which allows for separating the capital "Ó" from the lower-case "ó". This COLLATE is needed if your database default collation is not case-sensitive and the field in the table did not explicitly override the case-insensitive default with a case-sensitive collation.

DECLARE @Var NVARCHAR(50);
SET @Var = N'ąĄćĆęĘłŁńŃóÓśŚżŻźŹ';

SELECT @Var AS [Original],
       CONVERT(VARCHAR(50), @Var) AS [AllButTheÓs],
       REPLACE(CONVERT(VARCHAR(50), @Var), 'Ó' COLLATE Latin1_General_BIN, 'O')
            AS [ReplacedThe_Ó],
       REPLACE(CONVERT(VARCHAR(50), @Var), 'ó' COLLATE Latin1_General_BIN, 'o')
            AS [ReplacedThe_ó],
       REPLACE(
               REPLACE(
                       CONVERT(VARCHAR(50), @Var),
                       'Ó' COLLATE Latin1_General_BIN,
                       'O'
                      ),
               'ó' COLLATE Latin1_General_BIN,
               'o'
              ) AS [AllReplaced];

Returns:

Original            AllButTheÓs         ReplacedThe_Ó       ReplacedThe_ó       AllReplaced
ąĄćĆęĘłŁńŃóÓśŚżŻźŹ  aAcCeElLnNóÓsSzZzZ  aAcCeElLnNóOsSzZzZ  aAcCeElLnNoÓsSzZzZ  aAcCeElLnNoOsSzZzZ

If you want to be able to do this in a function rather than pasting that CONVERT/REPLACE/REPLACE everywhere, it would be best to do this in an Inline Table-Valued Function. Please note that no explicit CONVERT is issued as it is being handled implicitly via the datatype of the input parameter.

SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
GO

IF (OBJECT_ID('RemovePolishDiacriticSigns') IS NOT NULL)
BEGIN
    DROP FUNCTION RemovePolishDiacriticSigns;
END;
GO

CREATE FUNCTION RemovePolishDiacriticSigns(@TheString VARCHAR(MAX))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
   SELECT REPLACE(
                  REPLACE(
                          @TheString,
                          'Ó' COLLATE Latin1_General_BIN,
                          'O'
                         ),
                  'ó' COLLATE Latin1_General_BIN,
                  'o'
                 ) AS [TransformedString];

The function is used via CROSS APPLY as shown here:

SELECT *
FROM ( SELECT N'ąĄćĆęĘłŁńŃóÓśŚżŻźŹ' ) tmp(val)
CROSS APPLY dbo.RemovePolishDiacriticSigns(tmp.val);

Returns:

val                 TransformedString
ąĄćĆęĘłŁńŃóÓśŚżŻźŹ  aAcCeElLnNoOsSzZzZ

Unrelated to the specific question about replacing these Polish characters, here are some SQLCLR notes:

  1. When a function always returns the same output given the exact same input (across all input parameters), you should use IsDeterministic=true in the SqlFunction() attribute as there is a considerable performance benefit. Meaning: [SqlFunction(IsDeterministic=true)].
  2. There is no need to call .ToString() on the SqlString input parameter. All of the Sql* types have a .Value property that returns the expected value in the proper .Net type. Meaning: String nowy = toReplace.Value;.
  3. For more SQLCLR information, I have been writing SQLCLR-focused articles for SQL Server Central (free registration required) in their Stairway to SQLCLR series.
Community
  • 1
  • 1
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171