Weird Normalization on .net

Question

I am trying to normalize a string (using .net standard 2.0) using Form D, and it works perfectly and running on a Windows machine.

    [TestMethod]
    public void TestChars()
    {            
        var original = "é";
        var normalized = original.Normalize(NormalizationForm.FormD);

        var originalBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(original));
        Assert.AreEqual("233,0", originalBytesCsv);

        var normalizedBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(normalized));
        Assert.AreEqual("101,0,1,3", normalizedBytesCsv);
    }

When I run this on Linux, it returns "253,255" for both strings, before and after normalization. These two bytes form the word 65533 which is the Unicode Replacement char, used when something goes wrong with encoding. That's the part where I am lost.

What am I missing here? Is there someone to point me in the right direction?

Maybe it's related to the encoding of the source file. What happens with `var original = "\u00e9"`? — nwellnhof, Aug 21 '18 at 10:26
That was exactly the problem! My file was using Windows-1252 encoding. Please, answer below, so I can mark the correct one. — Jonathas Costa, Aug 21 '18 at 12:50

score 2 · Accepted Answer · answered Aug 21 '18 at 17:09

2

It might be related to the encoding of the source file. I'm not sure which encodings .net on Linux supports, but to be on the safe side, you should use plain ASCII source files and Unicode escapes for Non-ASCII characters:

var original = "\u00e9";

answered Aug 21 '18 at 17:09

nwellnhof

32,319
7
89
113

score 1 · Answer 2 · answered Aug 22 '18 at 02:36

There is no text but encoded text.

When communicating text to person or program, both the bytes and the character encoding are essential.

The C# compiler (like all programs that process text, except in special cases like JSON) must know which character encoding the input files use. You must inform it accurately. The default is UTF-8 and that is a fine choice, especially for C# files, which are, lexically, sequences of Unicode codepoints.

If you used your editor or IDE or file transfer without full mindfulness of these requirements, you might have used an unintended character encoding.

For example, "é" when saved as Windows-1252 (0xE9) but read as UTF-8 (leading code unit that should be followed by two continuation code units), would give � to indicate this mishandling to the readers.

To be on the safe side, use UTF-8 everywhere but do it mindfully.

Weird Normalization on .net

2 Answers2