7

Update. I created a test project on GitHub, where you can see the tests are passing on Appveyor (Windows) and failing on Travis (both Linux and OSX).
https://github.com/nopara73/UTF8Problems/


I have the a .NET Core 2 xUnit test project, where I am testing the UTF8 encoding of the string "é". On Windows the tests are passing, on Linux and OSX, they are failing.


Code.

[Fact]
public void CanEncode()
{
    var character = "é";
    var encoded = Encoding.UTF8.GetBytes(character);

    var bytes = new byte[] { 195, 169 };

    Assert.Equal(bytes, encoded);
}

[Fact]
public void CanDecode()
{
    var character = "é";

    var bytes = new byte[] { 195, 169 };
    var decoded = Encoding.UTF8.GetString(bytes);

    Assert.Equal(character, decoded);
}

[Fact]
public void CanEncodeDecode()
{
    var character = "é";
    var encoded = Encoding.UTF8.GetBytes(character);

    var decoded = Encoding.UTF8.GetString(encoded);

    Assert.Equal(character, decoded);
}

Failing output. Travis, Linux:

enter image description here


Questions.

  • What is the reason for this behavior?
  • How should I encode such strings to make sure I get identical results, regardless of the platform?
nopara73
  • 502
  • 6
  • 24
  • Did you compile code separately for each platform or compile only once and copy binary? – user4003407 Jan 13 '18 at 08:47
  • The reason might be the encoding in which the source code file is saved on the device. – Martin Zikmund Jan 13 '18 at 08:52
  • @PetSerAl separately. @MartinZikmund My exact method was: I cloned this Git repo and run the `TestStrangeLinuxBug` test: https://github.com/nopara73/TorOverTcp/blob/master/TorOverTcp.Tests/TotModelsTests.cs#L78-L92 – nopara73 Jan 13 '18 at 09:39
  • 5
    @nopara73 Resave source code files as UTF-8 with BOM, so compiler can properly detect it no both platforms and not depend on default codepage. Or use [`/codepage`](https://learn.microsoft.com/dotnet/csharp/language-reference/compiler-options/codepage-compiler-option) compiler option to explicitly specify encoding of your source files. – user4003407 Jan 13 '18 at 09:56
  • 2
    The linux version appears to have literally encoded a "I don't know" character, #65533: http://www.fileformat.info/info/unicode/char/0fffd/index.htm - this is very odd! and suggests a problem with the encoding...? – Marc Gravell Jan 13 '18 at 10:27
  • I updated the post with a small xUnit reproduction on GitHub. @PetSerAI's idea seems reasonable, yet it doesn't work. I tried with both `charset = utf8` and `charset = utf8-bom`, then resaved it, yet it doesn't make a difference. If you go the commits you can see Appveyor (Windows) passing and Travis (Linux and OSX) failing on commits: https://github.com/nopara73/UTF8Problems/commits/master – nopara73 Jan 13 '18 at 20:38
  • Your source `UnitTest1.cs` is not in UTF-8 encoding. – user4003407 Jan 14 '18 at 08:32
  • @PetSerAl Indeed, I did not restart Visual Studio after I added `utf-8-bom` charset to `.editorconfig`, I tested it with `utf-8-bom` and before I resaved the source. Now I did it with bot charset `utf8` and `utf8-bom` and only utf8 cheks. I guess that's expected. https://github.com/nopara73/UTF8Problems/pulls – nopara73 Jan 15 '18 at 17:52
  • Again. It turns out `utf-8` in itself works, too, it's just the only way I was able to trigger VS to save properly is to copy the source file content, delete the file, create a new file with the same name and paste the old file content. – nopara73 Jan 16 '18 at 04:44

0 Answers0