0

I have code that reads data from a textbox.text control into a byte array. It uses UTF8 encoding and there has not been any issues. The code reads, say, M number of bytes from the textbox, and adds it to output, as bytes. That all works fine.

When the data is written back, if the text is Non-English language, there are often problems. For instance if the text is the Chinese char 南 say repeated a few times, which seems to be, for the text box, 0xE5, 0x8D, 0x97.

When the data is written back to the text box, if say, the first write ended on 0xE5, when the next batch of data is written back starting with 0x8D 0x97, it is transformed somehow to 0xEF 0xBF 0xBD.

enter image description here

I'm just using Array.Copy. Nothing special. With English, no problem. With Chinese (and Japanese as well), the first write goes OK but the second write has some of these "corrupted" chars.

Ron
  • 2,435
  • 3
  • 25
  • 34
  • There is no text but encoded text. Bytes are not characters. You should only split text at [grapheme cluster boundaries](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries). When concatenating serialized text with a [BOM](http://unicode.org/faq/utf_bom.html#bom1), be sure to elide BOMs so that there is at most one BOM and it is at the beginning of the serialization, before the text. – Tom Blodget Sep 23 '18 at 19:33

2 Answers2

1

The problem mus t not be related to reading from/writing to textbox. The problem is how you convert text to byte and back. you have not provided any code, so my code must not be exactly what you want but for converting UTF-8 string to bytes you can do:

byte[] bytes = System.Text.Encoding.UTF8.GetBytes(textBox1.Text);

To convert byte[] to string:

textbox1.Text = System.Text.Encoding.UTF8.GetString(bytes);

If you Ignore Encoding and just use ascii encoding, it will lead to loss of data when converting to byte.

There is also a question related to converting Chinese to byte[]: How to encode and decode Broken Chinese/Unicode characters?

Ashkan Mobayen Khiabani
  • 33,575
  • 33
  • 102
  • 171
0

First, thanks for that information. I only used Chinese as an example. The code will not know the language and should not care. It could be Hindi or Japanese. Your conversion byte[] to string is what I use.

After I posted the question I realized that the code seems to correctly handle data, just not writing back to the Textbox text control. I'm not sure what the control is doing, perhaps it "detects" the language or detects it's not UTF8 and tries some kind of encoding.

BUT in any case I deferred writing the bytes back into the text box until the end and that seems to work just fine. That is to say, I keep adding the bytes back into an array using Array.Copy(...) and at the end write the whole thing back into the text box using UTF8, as you mentioned.

Ron
  • 2,435
  • 3
  • 25
  • 34
  • 1
    This seems like a comment to @Ashkan's [answer](https://stackoverflow.com/a/52469210/2226988) rather than an answer. As answer, it's a bit hard to follow. – Tom Blodget Sep 23 '18 at 19:35
  • This description should be part of your question. And you should post the code you're using, otherwise it's all a guess game. Anyway, about the TextBox control behaviour, see this [about Font fallback](https://stackoverflow.com/questions/51608365/how-can-label-control-display-japanese-characters-properly-when-font-of-the-labe?answertab=active#tab-top). – Jimi Sep 23 '18 at 20:35