0

I have a string with few characters in Thai. This string is using unicode characters. But I don't see thai characters in IDE or even if I write the string in text file. If I want to see thai characters properly I have to write the following code

 var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";
 var ascii = Encoding.Default.GetBytes(text);           
 text = Encoding.UTF8.GetString(ascii);

After applying above logic, I can see string correctly with thai characters. Here is output

// notice the thai character เดี่ยว in the string M_M-150 150CC. เดี่ยว (2 For 18 Save 2)

I am not sure why I need to apply above logic to see the thai characters even the string is Unicode? What exactly Encoding.Default is doing in this case?

parag
  • 2,483
  • 3
  • 20
  • 34
  • 1
    `var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";` Is this your actual code or are you getting the data from somewhere? If it's not your code, please [edit] to show the real problem code or data source. – Tom Blodget Jun 21 '18 at 00:00
  • Just as @TomBlodget has indicated, where do you get that `text` from? I have an ASP.NET project, I have this (`var text2 = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";`) in Visual Studio 2017 and the Thai characters are displayed properly in **IDE**! – Just a HK developer Jun 27 '18 at 09:40
  • @user3454439 I don't see thai character (เดี่ยว) neither in IDE nor in File. I need to apply following code to see it correctly var ascii = Encoding.Default.GetBytes(text); text = Encoding.UTF8.GetString(ascii); – parag Jun 27 '18 at 10:06
  • Then your question seems not related to programming. It seems your computer does not know how to display Thai character!? What OS are you using? What is the regional setting in your computer? Are you using Visual Studio? What version? Mine is Windows 10, with regional setting set to English (US), I can still see the Thai character in Visual Studio 2017. As for the output to file, use Notepad++, it can help to view file with non-English characters. – Just a HK developer Jun 28 '18 at 02:21
  • Issue is not with machine. After applying above logic mentioned in the code snippet, I can see thai characters. So definitely it is not machine specific issue. Other coworkers are also having the same issue. – parag Jun 28 '18 at 06:10

1 Answers1

6

From MSDN

Here is what Encoding.Default Property is:

Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.

The string is coming in by Encoding.Default, but then Decoded using UTF8 So the bottleneck is not the Encoding.Default. It's Encoding.UTF8 It's taking the bytes and convert it to string correctly.

Even if you tried to print it in the Console. Take a look at both cases : enter image description here The second line, printed with utf8 configuration You can config your console to support utf8 by adding this line :

Console.OutputEncoding = Encoding.UTF8;

Even with your code : the result in a file will be looks like : enter image description here

while converting the string to byte with Encoding.UTF8

var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);

the result will be :

enter image description here

If you take a look at Supported Scripts you'll see that UTF8 supports all Unicode characters

including Thai.

Note that the Encoding.Default will not be able to read chinese or japanese for an example,

take this example :

var text = "漢字";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);

Here is the output from a text file :

enter image description here

Here if you try to write it to text, it'll not be converted successfully.

So you have to read and write it using UTF8

 var text = "漢字";
 var ascii = Encoding.UTF8.GetBytes(text);
 text = Encoding.UTF8.GetString(ascii);

and you'll get this :

enter image description here

So as I said, the whole process depending on UTF8 not Default encoding.

Kaj
  • 806
  • 6
  • 16
  • So question is var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)"; is already unicode string. But why can't I see thai characters in it? Why I need to apply the logic to see the thai characters even if it is unicode string. – parag Jun 21 '18 at 07:10
  • 1
    @parag all strings in .NET are Unicode already (UTF16). The *console* can't display all of them if it uses the wrong codepage, or the font doesn't support a specific language (extremely rare these days, only the newest emojis are missing). Kaj already answered how to change the codepage for the console – Panagiotis Kanavos Jun 21 '18 at 07:18
  • @ Panagiotis Kanavos, It is not problem with console. Whether I write to IDE or file I can't see thai characters even it is unicode string (as you already mentioned it). My question is why I need to apply above logic event it is unicode string.Unicode string should display thai characters properly. – parag Jun 21 '18 at 07:35