1

Goal:
Decode from utf 8 to clean text

Problem:
Based on this code below, it doesn't want to decode from "masaväg" to "masaväg".

What part am I missing?

Thank you!

Info:
It works to decode from "masaväg" to "masaväg" in this page https://www.browserling.com/tools/utf8-decode

    UTF8Encoding utf8 = new UTF8Encoding();
    String unicodeString = "masaväg";
    // Encode the string.
    Byte[] encodedBytes = utf8.GetBytes(unicodeString);
    // Decode bytes back to string.
    String decodedString = utf8.GetString(encodedBytes);
HelloWorld1
  • 13,688
  • 28
  • 82
  • 145
  • 4
    you say "utf8" and "clean text" as though there's a difference. It is meaningless to talk about utf8 in terms of characters - utf8 is an *encoding*: what is more interesting is the *bytes*. What you *seem* to be asking is "I have some bytes that I've decoded incorrectly; how do I fix that?" - the answer is: don't decode them incorrectly in the first place...? – Marc Gravell Dec 14 '17 at 14:18

1 Answers1

4

The correct utf8 for "masaväg" is hex 6d 61 73 61 76 c3 a4 67

It looks like you've decoded this using the wrong encoding; we can figure out which that might be like so:

var bytes = Encoding.UTF8.GetBytes("masaväg");
foreach(var enc in Encoding.GetEncodings())
{
    try
    {
        if(enc.GetEncoding().GetString(bytes) == "masaväg")
        {
            Console.WriteLine($"{enc.CodePage} {enc.DisplayName}");
        }
    } catch { }
}

which outputs:

1252 Western European (Windows)
1254 Turkish (Windows)
28591 Western European (ISO)
28594 Baltic (ISO)
28599 Turkish (ISO)
65000 Unicode (UTF-7)

Now: I don't know which of those you used, but let's assume it was 1252.

So to reverse this mess (noting that this is unreliable and your data may already be corrupted irretreivably if you only have it as this garbled text data rather than as the underlying encoded bytes):

var enc = Encoding.GetEncoding(1252);
var bytes = enc.GetBytes("masaväg");
var viaUtf8 = Encoding.UTF8.GetString(bytes);
Console.WriteLine(viaUtf8);

which outputs:

masaväg

Note the important thing here isn't that "masaväg" is "utf8" or that "masaväg" is "clean text"; rather: "masaväg" is what you get if you use the wrong encoding to decode bytes into text. In this case the correct encoding to use when decoding would have been utf8. It is only the binary data that "is utf8". Once it is text (a string in .NET terms): it is code-points. And "encoding" (such as utf8) defines how code-points map to bytes (that's literally what an "encoding" is).

Note: code-page 1252 is often what Encoding.Default is, hence why 1252 is a safe assumption. You should never ever use Encoding.Default for anything, frankly. You should always know what encoding you intend to use. I suggest we should submit a PR to rename Encoding.Default to Encoding.PotLuck.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900