Yes, this is an issue of encoding. By default, strings in SQL Server are ASCII (one byte; values 0 - 255) unless denoted with a preceding capital "N" (lower-case "n" doesn't work). If the string literal is preceded by a capital-"N", then it is handled as UNICODE (technically UCS-2; two-bytes; values 0 - 65535).
The following SQL will show the difference for your test string in terms of the default VARCHAR (i.e. no "N" preceding the string literal) and NVARCHAR (i.e. string literal preceded by an "N"):
SELECT 'ąśćźń' AS [OriginalAsVARCHAR],
N'ąśćźń' AS [OriginalAsNVARCHAR],
CONVERT(NVARCHAR(50), 'ąśćźń') AS [VARCHARConvertToNVARCHAR],
ASCII('ą') AS [AsciiValueOf_ą],
UNICODE(N'ą') AS [UnicodeValueOf_ą];
Returns:
AsVARCHAR AsNVARCHAR VARCHARConvertToNVARCHAR AsciiValueOf_ą UnicodeValueOf_ą
asczn ąśćźń asczn 97 261
As you can see in the first field, the literal that you were passing to the CLR function (i.e. MyReplace('ąśćźń')
) had already been replaced before it was sent into the function. Hence, the CLR function was not doing anything, which explains why it in fact did nothing when you passed in the real data. The third field above shows what happened as you passed the string literal into the function in the test that appeared to work; converting it to NVARCHAR didn't help as SQL Server couldn't store those characters in plain ''
. If you were to test by putting an "N" in front of the literal as it is passed in (i.e. MyReplace(N'ąśćźń')
) you would be testing what really happens when the NVARCHAR(MAX) field Something2
is passed in.
The final two fields above show the difference in the underlying value. The "AsciiValue" field is 97 because it had already transformed it into "a". The "UnicodeValue" field, however, shows the actual value.
So, most of the work of the desired replacements is done for you by SQL Server. There really is no reason at all to have a SQLCLR function here. I am all for SQLCLR, but in this case it provides no benefit and will just be slower than a pure T-SQL solution.
The only items not taken care of by the conversion to VARCHAR are the Ó
and ó
. The reason for this is that those two characters already fit into the ASCII range as their values are under 256. You can see that by running the following. Please keep in mind that ASCII values 128 - 255 map to different characters based on code pages. Since the two characters here fit within that 128 - 255 range, it will not always be the case that they even have a mapping within the ASCII range. However, in the case that they don't they should be handled automatically in the CONVERT
just as the rest of the characters were. But that also means that any of those other characters might then have a mapping in whatever code page is in use, in which case those characters and not the Ó
and ó
will need to be handled via REPLACE
.
SELECT ASCII(N'Ó') AS [AsciiValueOf Ó],
UNICODE(N'Ó') AS [UnicodeValueOf Ó];
Returns:
AsciiValueOf Ó UnicodeValueOf Ó
211 211
In order to fix those two remaining characters, we just use two simple calls to the T-SQL REPLACE
function. The following SQL demonstrates, for the entire replacement set noted in your C# function, each of the three steps and then all of them together. Please note the use of COLLATE Latin1_General_BIN
which allows for separating the capital "Ó" from the lower-case "ó". This COLLATE
is needed if your database default collation is not case-sensitive and the field in the table did not explicitly override the case-insensitive default with a case-sensitive collation.
DECLARE @Var NVARCHAR(50);
SET @Var = N'ąĄćĆęĘłŁńŃóÓśŚżŻźŹ';
SELECT @Var AS [Original],
CONVERT(VARCHAR(50), @Var) AS [AllButTheÓs],
REPLACE(CONVERT(VARCHAR(50), @Var), 'Ó' COLLATE Latin1_General_BIN, 'O')
AS [ReplacedThe_Ó],
REPLACE(CONVERT(VARCHAR(50), @Var), 'ó' COLLATE Latin1_General_BIN, 'o')
AS [ReplacedThe_ó],
REPLACE(
REPLACE(
CONVERT(VARCHAR(50), @Var),
'Ó' COLLATE Latin1_General_BIN,
'O'
),
'ó' COLLATE Latin1_General_BIN,
'o'
) AS [AllReplaced];
Returns:
Original AllButTheÓs ReplacedThe_Ó ReplacedThe_ó AllReplaced
ąĄćĆęĘłŁńŃóÓśŚżŻźŹ aAcCeElLnNóÓsSzZzZ aAcCeElLnNóOsSzZzZ aAcCeElLnNoÓsSzZzZ aAcCeElLnNoOsSzZzZ
If you want to be able to do this in a function rather than pasting that CONVERT/REPLACE/REPLACE everywhere, it would be best to do this in an Inline Table-Valued Function. Please note that no explicit CONVERT
is issued as it is being handled implicitly via the datatype of the input parameter.
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
GO
IF (OBJECT_ID('RemovePolishDiacriticSigns') IS NOT NULL)
BEGIN
DROP FUNCTION RemovePolishDiacriticSigns;
END;
GO
CREATE FUNCTION RemovePolishDiacriticSigns(@TheString VARCHAR(MAX))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
SELECT REPLACE(
REPLACE(
@TheString,
'Ó' COLLATE Latin1_General_BIN,
'O'
),
'ó' COLLATE Latin1_General_BIN,
'o'
) AS [TransformedString];
The function is used via CROSS APPLY
as shown here:
SELECT *
FROM ( SELECT N'ąĄćĆęĘłŁńŃóÓśŚżŻźŹ' ) tmp(val)
CROSS APPLY dbo.RemovePolishDiacriticSigns(tmp.val);
Returns:
val TransformedString
ąĄćĆęĘłŁńŃóÓśŚżŻźŹ aAcCeElLnNoOsSzZzZ
Unrelated to the specific question about replacing these Polish characters, here are some SQLCLR notes:
- When a function always returns the same output given the exact same input (across all input parameters), you should use
IsDeterministic=true
in the SqlFunction()
attribute as there is a considerable performance benefit. Meaning: [SqlFunction(IsDeterministic=true)]
.
- There is no need to call
.ToString()
on the SqlString
input parameter. All of the Sql*
types have a .Value
property that returns the expected value in the proper .Net type. Meaning: String nowy = toReplace.Value;
.
- For more SQLCLR information, I have been writing SQLCLR-focused articles for SQL Server Central (free registration required) in their Stairway to SQLCLR series.