C#, .NET Core, Encoding.UTF8.GetBytes is not consistent

Question

Update. I created a test project on GitHub, where you can see the tests are passing on Appveyor (Windows) and failing on Travis (both Linux and OSX).
https://github.com/nopara73/UTF8Problems/

I have the a .NET Core 2 xUnit test project, where I am testing the UTF8 encoding of the string "é". On Windows the tests are passing, on Linux and OSX, they are failing.

Code.

[Fact]
public void CanEncode()
{
    var character = "é";
    var encoded = Encoding.UTF8.GetBytes(character);

    var bytes = new byte[] { 195, 169 };

    Assert.Equal(bytes, encoded);
}

[Fact]
public void CanDecode()
{
    var character = "é";

    var bytes = new byte[] { 195, 169 };
    var decoded = Encoding.UTF8.GetString(bytes);

    Assert.Equal(character, decoded);
}

[Fact]
public void CanEncodeDecode()
{
    var character = "é";
    var encoded = Encoding.UTF8.GetBytes(character);

    var decoded = Encoding.UTF8.GetString(encoded);

    Assert.Equal(character, decoded);
}

Failing output. Travis, Linux:

Questions.

What is the reason for this behavior?
How should I encode such strings to make sure I get identical results, regardless of the platform?

Did you compile code separately for each platform or compile only once and copy binary? — user4003407, Jan 13 '18 at 08:47
The reason might be the encoding in which the source code file is saved on the device. — Martin Zikmund, Jan 13 '18 at 08:52
@PetSerAl separately. @MartinZikmund My exact method was: I cloned this Git repo and run the `TestStrangeLinuxBug` test: https://github.com/nopara73/TorOverTcp/blob/master/TorOverTcp.Tests/TotModelsTests.cs#L78-L92 — nopara73, Jan 13 '18 at 09:39
@nopara73 Resave source code files as UTF-8 with BOM, so compiler can properly detect it no both platforms and not depend on default codepage. Or use [`/codepage`](https://learn.microsoft.com/dotnet/csharp/language-reference/compiler-options/codepage-compiler-option) compiler option to explicitly specify encoding of your source files. — user4003407, Jan 13 '18 at 09:56
The linux version appears to have literally encoded a "I don't know" character, #65533: http://www.fileformat.info/info/unicode/char/0fffd/index.htm - this is very odd! and suggests a problem with the encoding...? — Marc Gravell, Jan 13 '18 at 10:27
I updated the post with a small xUnit reproduction on GitHub. @PetSerAI's idea seems reasonable, yet it doesn't work. I tried with both `charset = utf8` and `charset = utf8-bom`, then resaved it, yet it doesn't make a difference. If you go the commits you can see Appveyor (Windows) passing and Travis (Linux and OSX) failing on commits: https://github.com/nopara73/UTF8Problems/commits/master — nopara73, Jan 13 '18 at 20:38
@PetSerAl Indeed, I did not restart Visual Studio after I added `utf-8-bom` charset to `.editorconfig`, I tested it with `utf-8-bom` and before I resaved the source. Now I did it with bot charset `utf8` and `utf8-bom` and only utf8 cheks. I guess that's expected. https://github.com/nopara73/UTF8Problems/pulls — nopara73, Jan 15 '18 at 17:52
Again. It turns out `utf-8` in itself works, too, it's just the only way I was able to trigger VS to save properly is to copy the source file content, delete the file, create a new file with the same name and paste the old file content. — nopara73, Jan 16 '18 at 04:44

C#, .NET Core, Encoding.UTF8.GetBytes is not consistent

0 Answers0