Decode UTF-8 bytes as Latin-1 characters

Question

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.

Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):

Ù…Ø¯Ù„-Ø±Ù†Ú¯-Ù…ÙˆÛŒ-Ø¬Ø¯ÛŒØ¯-5-436x500

whereas it should look like this:

مدل-رنگ-موی-جدید-5-436x500

This link convert this correctly:

http://www.ltg.ed.ac.uk/~richard/utf-8.html

How I can do it in c#?

How do you receive a string from a third party application? Files and network messages are usually bytes, not strings. — CodesInChaos, Jan 31 '17 at 12:18

score 1 · Answer 1 · answered Jan 31 '17 at 10:21

It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.

Anyhow, what is probably happening is this:

The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "Ø±", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.

However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.

What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.

I cannot easily reproduce your situation, so here are some things to try:

byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );

followed by

string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );

Decode UTF-8 bytes as Latin-1 characters

1 Answers1