Should I delete blank values in utf-16 encoding?

Question

When I read all the bytes from a string using Encoding.Unicode, It gives me blank (0) values.

When I run this code:

byte[] value = Encoding.Unicode.GetBytes("Hi");

It gives me the output

I know this is because UTF-16 stores 2 bytes and the 0 is just the second byte, but my question is should i delete the 0's? since as far as I know, they do not do anything and my program requires to loop through the array so the 0's would only make it slower.

I don't understand and i am confused. Why exactly do you decode text using UTF-16/Unicode encoding if you don't want or don't expect to deal with 16-bit values? Your desire of wanting to ignore the high-bytes of the 16-bit UTF-16 code points seems to contradict your act of decoding text as UTF-16 code points... — , Sep 30 '22 at 10:04
If you don't want UTF-16, then why decode as UTF-16? Simply decode as UTF-8 or maybe even simple ASCII — knittl, Sep 30 '22 at 10:06
@knittl there could be values that are not included in UTF-8/ASCII — electroroyaler, Sep 30 '22 at 10:07
@electroroyaler then those values will no longer be valid if you remove have of the value. — knittl, Sep 30 '22 at 10:08
@electroroyaler, and how would you sequeeze such character values into a single byte, then? — , Sep 30 '22 at 10:08
Deleting specific values from the array would require looping through the array anyway (and allocating a new array) - so just ignoring those 0's would be faster (and I am still questioning the validity of this process) — Hans Keﬆing, Sep 30 '22 at 10:08
@MySkullCaveIsADarkPlace The string could ALSO contain values that need the full bytes but the majority of the string will only be characters that are included in ASCII encoding — electroroyaler, Sep 30 '22 at 10:10
@electroroyaler, oh great. Now tell me how you would then in your byte array identify whether a byte is just a single character value, of if a byte is just one of two bytes of another character value? Have you actually thought about how you are going to process such a byte array where the 0-bytes have been stripped? — , Sep 30 '22 at 10:11
@MySkullCaveIsADarkPlace That's what I was asking, if there was a way to add only the bytes needed — electroroyaler, Sep 30 '22 at 10:13
Why don't you use UTF-8? It is compatible with ASCII and uses only a single byte for all ASCII characters. Other characters are encoded with 2-4 bytes. UTF-8 does not encode to null bytes. — knittl, Sep 30 '22 at 10:15
"_add only the bytes needed_" That harks back to my last comment and the question i posed to you: Can you process the byte array successfully without the 0-bytes in them, even when there is a mix of 1-byte and 2-byte character values in them? If you can answer this with "yes", then you don't need the 0-bytes in the array. If you don't find a way to process your byte array correctly without the 0-bytes in it, then the answer is "no", and you cannot afford to strip the 0-bytes from the array, obviously. (Hint: The answer is no unless you restrict yourself to use 1-byte character values **only**) — , Sep 30 '22 at 10:16
Deleting zeroes would cause a [mojibake](https://en.wikipedia.org/wiki/Mojibake) case (*example in Python for its universal intelligibility*): `"Hi electroroyaler!".encode( 'utf-16-le').replace( b'\x00' ,b'').decode( 'utf-16-le')` returns `'楈攠敬瑣潲潲慹敬ⅲ'`… — JosefZ, Sep 30 '22 at 10:17

knittl · Accepted Answer · 2022-09-30T17:34:09.967

1

No, you must not delete bytes from a text encoding, because then you end up with garbage that can no longer be considered a valid encoding of the text.

If you have many ASCII characters and a few non-ASCII characters, you are probably better off with the UTF-8 encoding instead of UTF-16.

UTF-8 encodes to a single byte for ASCII chars and uses 2-4 bytes for non-ASCII chars.

Here's an illustrative example:

var text = "ö";
Console.WriteLine(string.Join(",", Encoding.Unicode.GetBytes(text))); // 246,0
Console.WriteLine(string.Join(",", Encoding.UTF8.GetBytes(text))); // 195,182
Console.WriteLine(string.Join(",", Encoding.UTF32.GetBytes(text))); // 246,0,0,0

Identical text/character/letter, different encodings

edited Sep 30 '22 at 17:34

answered Sep 30 '22 at 10:18

knittl

246,190
53
318
364

Is there a way to check whether the text contains anything that UTF-8 cannot encode and then switch to UTF-16 accordingly? – electroroyaler Sep 30 '22 at 10:23
4

@electroroyaler: UTF-8 can encode *all* Unicode characters. – Jon Skeet Sep 30 '22 at 10:24
UTF-8 can encode everything that UTF-16 can encode. – knittl Sep 30 '22 at 10:24
@JonSkeet oh ah, I had assumed that since UTF-16 handles more bytes, it would have a larger amount Unicode characters. But what is the point of UTF-16 then? if it encodes the same thing as UTF-8? – electroroyaler Sep 30 '22 at 10:27
2

They are different encodings for the same characters (code points). UTF16 uses 2 or 4 bytes, UTF8 uses 1-4 bytes per character. [Which char is not in UTF-16?](https://stackoverflow.com/a/50331272/112968) – knittl Sep 30 '22 at 10:30
1

@electroroyaler: Utf-16 is actual how Windows store strings internally, so it was actual made before Utf-8. But the rule is simple: ALWAYS use utf-8, it just works. – Poul Bak Sep 30 '22 at 11:56

Should I delete blank values in utf-16 encoding?

1 Answers1