Converting string from CP866 to UTF8

Question

I have database(MSSQL) and it has a table with translations for Product names. One of the languages is russian.

Example of a database entry ¸ą¤®åą Øā«ģ using Universal Cyrillic decoder I managed to find out that it is Прдохранитль as well as that the source encoding is CP866 and I need it to get WIndows-1257 or utf-8.

How to do this in C#?

I tried something like

string line = "¸ą¤®åą Øā«ģ";

Encoding cp866 = Encoding.GetEncoding("CP866");
Encoding w1257 = Encoding.GetEncoding("windows-1257");
byte[] cp866Bytes = cp866.GetBytes(line);
byte[] w1257Bytes = Encoding.Convert(cp866, w1257, cp866Bytes);
var lineFinal = w1257.GetString(w1257Bytes);

Could anyone help me?

The result for the given code is ?a?¤Raa -Oa?<g

what's the result of what you tried ? Did you got an error ? — N.K, May 14 '18 at 08:44
Something is extremly off. MSSQL supports Unicode since a few itterations now. So somebody had to do something **really** wrong before writing those values to the DB to even make this issue exist. There was a lot of early missunderstandings regarding Unicode, but this articles should clean those up: http://www.joelonsoftware.com/articles/Unicode.html — Christopher, May 14 '18 at 08:47
At the point where you have a .NET string `string line = "¸ą¤®åą Øā«ģ";`, things have already gone wrong. Instead of trying to convert this wrong string into a good one, try to find the cause of that wrong string. What is the SQL data type of the column where this entry is found? `varchar`? And what is the __collation__ in SQL Server of that column? Is the string correct in the database, and in accordance with the collation? — Jeppe Stig Nielsen, May 14 '18 at 09:24
@JeppeStigNielsen Very good point. If it's in the db as bytes and those are interpreted as Win-1257, then that's plain wrong in this case, and if they would be seen as CP866 it would simply return the correct string. But if the db has it as UTF-8 then it's already corrupted and does need to actively be converted back. — Nyerguds, May 14 '18 at 09:40

score 2 · Accepted Answer · answered May 14 '18 at 09:01

Leaving aside questions about how such string could end up in the database in first place, you can convert it like this:

string line = "¸ą¤®åą Øā«ģ";
Encoding w1257 = Encoding.GetEncoding("windows-1257");
Encoding cp866 = Encoding.GetEncoding("CP866");            
var lineFinal = cp866.GetString(w1257.GetBytes(line));

Because your original string appears to use 1257 code page, and you need CP866.

Note that this specific string is a big damaged still, it results in Предохр нитель and correct word is Предохранитель (so we have space instead of а at index 8). However, original string also contains space at this position, so this damage is not result of decoding (probably you just copied it wrong into the question).

Nyerguds · Answer 2 · 2018-05-14T09:43:23.567

Your problem is that you are doing it the other way around. line does not show Cyrillic. The characters you are looking at are Windows-1257 characters. When you save a string as an encoding, you are matching the symbols to that encoding, not interpreting them as that encoding, meaning this will only corrupt it further.

Also realize that text in .Net has no encoding (or, no encoding you need to care about, anyway). A String is just a String, a series of unicode characters. Encoding only becomes relevant when you need it as bytes.

Since we know that those characters, when in the Windows-1257 encoding, will contain the correct byte values needed to view them in CP866, but at this moment they are pure-unicode String and not Windows-1257, you need to first convert it to windows-1257 bytes, and then interpret those bytes as being CP866.

String line = "¸ą¤®åą Øā«ģ";
Encoding cp866 = Encoding.GetEncoding("CP866");
Encoding w1257 = Encoding.GetEncoding("windows-1257");
Byte[] w1257Bytes = w1257.GetBytes(line);
String lineFinal = cp866.GetString(w1257Bytes);

Converting string from CP866 to UTF8

2 Answers2