2

I read some text from html website and need to store the data. I know the source encoding (iso-8859-1) and destination encoding(windows-874) from the website http://alexpad.com/textdecoder

The source text is "áÁèÃÔÁ" and I need to convert it to "แม่ริม" However the output is always be "??????" which are arrays of [63,63,63,63,63,63]

string text = "áÁèÃÔÁ";
Encoding fromEncoding = Encoding.GetEncoding("iso-8859-1");
Encoding toEncoding = Encoding.GetEncoding("windows-874");
byte[] fromBytes = fromEncoding.GetBytes(text);
byte[] toBytes = Encoding.Convert(fromEncoding, toEncoding, fromBytes);
string result = toEncoding.GetString(toBytes);

expected result is "แม่ริม" actual results is "??????" which is wrong

  • 1
    Check this answer out https://stackoverflow.com/a/7236718/8711654 – Faizan Aug 02 '19 at 07:07
  • You shouldn't convert anything. The bytes are already how you want them. You just need to interpret them. – Nyerguds Aug 02 '19 at 09:00
  • All text in HTML is Unicode, regardless of the document encoding. So, once you get the appropriate text using an HTML parser, it'll already be in a String, (and of course it won't be áÁèÃÔÁ). – Tom Blodget Aug 03 '19 at 07:04

1 Answers1

1

The difference between the two encodings is a value of 160. So is one lower case and the other uppercase?

            string iso = "áÁèÃÔÁ";
            string[] isoBytes = iso.Select(x => ((byte)x).ToString()).ToArray();
            Console.WriteLine("Iso " + string.Join(",",isoBytes));

            string win = "แม่ริม";
            string[] winBytes = win.Select(x => ((byte)x).ToString()).ToArray();
            Console.WriteLine("Windows " + string.Join(",",winBytes));

            Console.ReadLine();
jdweng
  • 33,250
  • 2
  • 15
  • 20