-1

When I read all the bytes from a string using Encoding.Unicode, It gives me blank (0) values.

When I run this code:

byte[] value = Encoding.Unicode.GetBytes("Hi");

It gives me the output

72
0
105
0

I know this is because UTF-16 stores 2 bytes and the 0 is just the second byte, but my question is should i delete the 0's? since as far as I know, they do not do anything and my program requires to loop through the array so the 0's would only make it slower.

  • 1
    I don't understand and i am confused. Why exactly do you decode text using UTF-16/Unicode encoding if you don't want or don't expect to deal with 16-bit values? Your desire of wanting to ignore the high-bytes of the 16-bit UTF-16 code points seems to contradict your act of decoding text as UTF-16 code points... –  Sep 30 '22 at 10:04
  • 2
    If you don't want UTF-16, then why decode as UTF-16? Simply decode as UTF-8 or maybe even simple ASCII – knittl Sep 30 '22 at 10:06
  • @knittl there could be values that are not included in UTF-8/ASCII – electroroyaler Sep 30 '22 at 10:07
  • 2
    @electroroyaler then those values will no longer be valid if you remove have of the value. – knittl Sep 30 '22 at 10:08
  • @electroroyaler, and how would you sequeeze such character values into a single byte, then? –  Sep 30 '22 at 10:08
  • Deleting specific values from the array would require looping through the array anyway (and allocating a new array) - so just ignoring those 0's would be faster (and I am still questioning the validity of this process) – Hans Kesting Sep 30 '22 at 10:08
  • @MySkullCaveIsADarkPlace The string could ALSO contain values that need the full bytes but the majority of the string will only be characters that are included in ASCII encoding – electroroyaler Sep 30 '22 at 10:10
  • @electroroyaler, oh great. Now tell me how you would then in your byte array identify whether a byte is just a single character value, of if a byte is just one of two bytes of another character value? Have you actually thought about how you are going to process such a byte array where the 0-bytes have been stripped? –  Sep 30 '22 at 10:11
  • @MySkullCaveIsADarkPlace That's what I was asking, if there was a way to add only the bytes needed – electroroyaler Sep 30 '22 at 10:13
  • 1
    Why don't you use UTF-8? It is compatible with ASCII and uses only a single byte for all ASCII characters. Other characters are encoded with 2-4 bytes. UTF-8 does not encode to null bytes. – knittl Sep 30 '22 at 10:15
  • "_add only the bytes needed_" That harks back to my last comment and the question i posed to you: Can you process the byte array successfully without the 0-bytes in them, even when there is a mix of 1-byte and 2-byte character values in them? If you can answer this with "yes", then you don't need the 0-bytes in the array. If you don't find a way to process your byte array correctly without the 0-bytes in it, then the answer is "no", and you cannot afford to strip the 0-bytes from the array, obviously. (Hint: The answer is no unless you restrict yourself to use 1-byte character values **only**) –  Sep 30 '22 at 10:16
  • Deleting zeroes would cause a [mojibake](https://en.wikipedia.org/wiki/Mojibake) case (*example in Python for its universal intelligibility*): `"Hi electroroyaler!".encode( 'utf-16-le').replace( b'\x00' ,b'').decode( 'utf-16-le')` returns `'楈攠敬瑣潲潲慹敬ⅲ'`… – JosefZ Sep 30 '22 at 10:17

1 Answers1

1

No, you must not delete bytes from a text encoding, because then you end up with garbage that can no longer be considered a valid encoding of the text.

If you have many ASCII characters and a few non-ASCII characters, you are probably better off with the UTF-8 encoding instead of UTF-16.

UTF-8 encodes to a single byte for ASCII chars and uses 2-4 bytes for non-ASCII chars.

Here's an illustrative example:

var text = "ö";
Console.WriteLine(string.Join(",", Encoding.Unicode.GetBytes(text))); // 246,0
Console.WriteLine(string.Join(",", Encoding.UTF8.GetBytes(text))); // 195,182
Console.WriteLine(string.Join(",", Encoding.UTF32.GetBytes(text))); // 246,0,0,0

Identical text/character/letter, different encodings

knittl
  • 246,190
  • 53
  • 318
  • 364
  • Is there a way to check whether the text contains anything that UTF-8 cannot encode and then switch to UTF-16 accordingly? – electroroyaler Sep 30 '22 at 10:23
  • 4
    @electroroyaler: UTF-8 can encode *all* Unicode characters. – Jon Skeet Sep 30 '22 at 10:24
  • UTF-8 can encode everything that UTF-16 can encode. – knittl Sep 30 '22 at 10:24
  • @JonSkeet oh ah, I had assumed that since UTF-16 handles more bytes, it would have a larger amount Unicode characters. But what is the point of UTF-16 then? if it encodes the same thing as UTF-8? – electroroyaler Sep 30 '22 at 10:27
  • 2
    They are different encodings for the same characters (code points). UTF16 uses 2 or 4 bytes, UTF8 uses 1-4 bytes per character. [Which char is not in UTF-16?](https://stackoverflow.com/a/50331272/112968) – knittl Sep 30 '22 at 10:30
  • 1
    @electroroyaler: Utf-16 is actual how Windows store strings internally, so it was actual made before Utf-8. But the rule is simple: ALWAYS use utf-8, it just works. – Poul Bak Sep 30 '22 at 11:56