0

I'm running a Dart web server, with Dart on the client side as well. The web data is saved in files and in a Postgres database.

Since dartlang is UTF-16 (because Webkit strings are UTF-16), does it make sense to go to UTF-16 whole hog? That is, instead of default UTF-8, make the following native UTF-16:

  • files (web pages)
  • database (web data)
  • HTML encoding

It seems there would be a small hit on data transfer, but at the same time more efficient in the server and browser, and there would be less of a chance for accidental screw-ups.

hopper
  • 13,060
  • 7
  • 49
  • 53
cc young
  • 18,939
  • 31
  • 90
  • 148
  • Old, but related: [What could go wrong in switching HTML encoding from UTF-8 to UTF-16?](http://stackoverflow.com/q/865168) it's definitely not more efficient for the client, as traffic would double in the worst case scenario – Pekka Nov 21 '13 at 04:48
  • 1
    @Pekka웃 worst case for traffic doubling would being English, of course. if files are compressed, then not much. data from web sockets still twice as large, but data packets still under 512 bytes, so really not much difference. although I'm no guru ;) – cc young Nov 21 '13 at 06:37
  • If it ain't b0rke, don't fix it. – Denis de Bernardy Nov 21 '13 at 09:06
  • @Denis - alas, a philosophy which I can appreciate but do not believe. I like things really nice and shiny neat. hence the question: is the gleam of utf-16 real or only a sparkle the mind's eye. – cc young Nov 21 '13 at 13:56
  • What do you mean by "dart on clients as well" - is there no web server in this equation at all? I agree with Denis, though - as long as you have no issue with UTF-8, why change it. – Pekka Nov 21 '13 at 14:00
  • @Pekka웃 client scripts are in Dart, eg, lib2.dart vs lib2.js. this issue is having to convert all data from server from utf-8 to utf-16 (and remembering to do it) – cc young Nov 22 '13 at 01:17
  • I doubt this will improve performance - even for Asian languages with multibyte characters. However you can write some benchmarks to test this out. I'd say for most web applications there are probably other low hanging fruit that could give larger performance gains for less effort. – Greg Lowe Nov 25 '13 at 23:20

1 Answers1

1

PostgreSQL does not support UTF-16 encoding, which limits what you are talking about doing. One of the big issues you are likely to run into elsewhere is that UTF-16 allows embedded nulls, which messes up C string manipulations, while UTF-8 is far more C friendly. For this reason, to be honest, I would try to standardize on UTF-8 to the extent possible.

Chris Travers
  • 25,424
  • 6
  • 65
  • 182
  • I'm pretty sure UTF-8 also allows NULL bytes (representing themselves) because it is completely transparent for 7-bit ASCII, including control characters. It's *possible* to encode a NULL byte in other ways using the logic of UTF-8, but these are not technically valid UTF-8, which mandates a single correct encoding for each character. See http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 – IMSoP Dec 10 '13 at 16:35
  • My understanding of UTF-8 is that null bytes can be used the same way they are in C strings, i.e. as string terminators (and that is a way in which it is backwards-compatible). – Chris Travers Dec 11 '13 at 01:31
  • Ah, I see what you mean, yes, UTF-8 won't introduce any null bytes that weren't already there. OTOH, if you're reviewing all your string manipulations to use Unicode correctly, it might not be too much of a stretch to make them binary-safe as well. – IMSoP Dec 12 '13 at 10:47
  • this is true, but it is also the reason why PostgreSQL probably won't support UTF-16 any time soon. – Chris Travers Dec 12 '13 at 10:50