Should I change from UTF-8 to UTF-16 to accommodate Chinese characters in my HTML?

Question

I am using ASP.NET MVC, MS SQL and IIS. I have a few users that have used Chinese characters in their profile info. However, when I display this information is shows up as æŽå¼·è¯ but they are correct in my database. Currently my UTF for my HTML pages is set to UTF-8. Should I change it to UTF-16? I understand there are a few problems that can come from this but what are my choices?

are you using `htmlentities()` or `htmlspecialchars()` when outputting? — Andrew67, Oct 05 '10 at 14:53
Have you tried specifying your character set in your meta tags? https://www.w3.org/International/questions/qa-html-encoding-declarations — Jonas Stawski, Aug 21 '17 at 20:18

score 28 · Answer 1 · edited Feb 20 '18 at 11:23

28

UTF-8 and UTF-16 encode exactly the same set of characters. It's not that UTF-8 doesn't cover Chinese characters and UTF-16 does. UTF-16 uses uniformly 16 bits to represent a character; while UTF-8 uses 1, 2, 3, up to a max of 4 bytes, depending on the character, so that an ASCII character is represented still as 1 byte. Start with this Wikipedia article to get the idea behind it.

So, there's little chance switching to UTF-16 will help you at all. There's a chance it makes things worse, as is discussed in the SO question you linked above. There's a problem somewhere else in your setup, which does not correctly take into account non-ASCII or non-Latin-1 characters. Make sure every part of your setup works in UTF-8.

edited Feb 20 '18 at 11:23

dannymac

517
5
14

answered Oct 05 '10 at 14:59

Yuji

34,103
3
70
88

4

UTF-16 can have 2 code units 16 bits each, taking 32 bits in total to represent a character, see some examples in http://en.wikipedia.org/wiki/UTF-16 – Anton Roslov Aug 22 '13 at 11:31
2

@yuji Actually UTF-8 can use up to 4 bytes. Initially it was 6 but after realising that this would be an overkill (we only use around 110.000 today while 6 bytes would allow for 2 billion!) people settled for 4 bytes http://tools.ietf.org/html/rfc3629 – joakim Nov 27 '14 at 01:27

score 6 · Answer 2 · answered Oct 05 '10 at 14:56

Any UTF coding should work the same in their ability to represent Unicode characters so switching to UTF-16 wouldn't help. There's an encoding issue somewhere and with UTF-16 you would only end up with different wrong HTML representation. Of course if you have some library that simply encodes non-ASCII characters as entities and does support wide characters, your problem may be solved by the switch. There are however characters that need even 2 wide characters and these would still be shown wrong, although users might rarely notice. The best option would be to have whatever is producing the HTML to interpret your UTF-8 correctly.

Should I change from UTF-8 to UTF-16 to accommodate Chinese characters in my HTML?

2 Answers2

Linked