The correct utf8 for "masaväg"
is hex 6d 61 73 61 76 c3 a4 67
It looks like you've decoded this using the wrong encoding; we can figure out which that might be like so:
var bytes = Encoding.UTF8.GetBytes("masaväg");
foreach(var enc in Encoding.GetEncodings())
{
try
{
if(enc.GetEncoding().GetString(bytes) == "masaväg")
{
Console.WriteLine($"{enc.CodePage} {enc.DisplayName}");
}
} catch { }
}
which outputs:
1252 Western European (Windows)
1254 Turkish (Windows)
28591 Western European (ISO)
28594 Baltic (ISO)
28599 Turkish (ISO)
65000 Unicode (UTF-7)
Now: I don't know which of those you used, but let's assume it was 1252.
So to reverse this mess (noting that this is unreliable and your data may already be corrupted irretreivably if you only have it as this garbled text data rather than as the underlying encoded bytes):
var enc = Encoding.GetEncoding(1252);
var bytes = enc.GetBytes("masaväg");
var viaUtf8 = Encoding.UTF8.GetString(bytes);
Console.WriteLine(viaUtf8);
which outputs:
masaväg
Note the important thing here isn't that "masaväg" is "utf8" or that "masaväg" is "clean text"; rather: "masaväg" is what you get if you use the wrong encoding to decode bytes into text. In this case the correct encoding to use when decoding would have been utf8. It is only the binary data that "is utf8". Once it is text (a string
in .NET terms): it is code-points. And "encoding" (such as utf8) defines how code-points map to bytes (that's literally what an "encoding" is).
Note: code-page 1252 is often what Encoding.Default
is, hence why 1252 is a safe assumption. You should never ever use Encoding.Default
for anything, frankly. You should always know what encoding you intend to use. I suggest we should submit a PR to rename Encoding.Default
to Encoding.PotLuck
.