2

I'm having some real issues with a site we're building on our bespoke content management system. The system renders all views via XSLT, which may be the problem.

The problem we're experiencing appears to be the result of character encoding mismatches, but I'm struggling to work out which part of the process is breaking down.

The issue does not occur in Firefox or Chrome, and in IE is fine for the initial load of the page and when it is refreshed, however, when using the 'back' button or 'forward' button in IE, I find that any unicode characters are showing as a white question mark in a black diamond which implies that the wrong character set is being used. We've also seen odd results as a result of this with the page as indexed by google (it appears to index the DOCTYPE reference and the content of the head element rather than the content as would normally be the case).

All of the XSLT stylesheets are outputting UTF-16 and the XSLT files themselves are UTF-16 files (previously there was a mismatch). The site is serving the pages as UTF-16 and the HTML output has a meta tag setting the content type to use a charset of UTF-16.

I've checked the results using Fiddler to see what's coming from the server, however, Fiddler isn't logging a request/response when IE uses the back/forward buttons, so presumably it's got them cached somewhere.

Anyone got any ideas?

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
Chris Disley
  • 1,286
  • 17
  • 30
  • Update to the above: When I view source in IE (IE8 in case it makes any difference) I am getting things showing up with encoding issues (my default text editor is loading it as hex) – Chris Disley Oct 17 '11 at 10:26
  • See http://www.w3.org/International/questions/qa-html-encoding-declarations#utf16 for some issues using UTF-16 on the web. Are you doing the XSLT transformation on the server within the content management system or are you letting the browser do the XSLT? Do you have a public URL we can visit? – Martin Honnen Oct 17 '11 at 16:20

2 Answers2

2

The site is serving the pages as UTF-16

Whoah! Don't do that.

There are several browser bugs to do with UTF-16 pages. I hadn't heard of this particular one before but it's common for UTF-16 to break form handling, for example. UTF-16 is very rarely used on the web, and as a consequence it turns up a lot of little-known bugs in browsers and other agents (like search engines and other tools written in one of the many scripting languages with poor Unicode support like PHP).

the HTML output has a meta tag setting the content type to use a charset of UTF-16

This has no effect. If the browser fails to detect UTF-16 then, because UTF-16 is not ASCII-compatible, it won't even be able to read the meta tag.

On the web, always use an ASCII-compatible encoding—usually UTF-8. UTF-8 is by far the best-supported encoding, and is almost always smaller in size than UTF-16. UTF-16 offers pretty much no advantage and I would avoid it in every case.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Yeah, well aware that UTF-8 is the preferred encoding across the web but I'm working with legacy code here. To change to UTF-8 would mean changing every XSLT file to output UTF-8, amending the XSLT Processing routine (.NET uses UTF-16 for all strings by default, so the method in use for processing at the moment will always generate UTF-16 output) and, above all else, other sites running on this CMS platform are working fine when set up to render UTF-16 files (although IIS does seem to serve them as UTF-8). – Chris Disley Oct 18 '11 at 20:26
0

Possibly IE is corrupting the files when they are read from the cache. Could be related to this (unfotunately unanswered) question

Firefox & IE: Corrupted data when retrieved from cache

A few things you could check/try:

  • Make sure encoding is specified in both http Content-Type: header and <?xml encoding=...> declaration at the top of the XML
  • Are you specifing the endian of your UTF-16 or relying on byte order mark? If the latter try specifying. I think windows is usually fond of UTF-16LE.
  • Are you able to try another encoding? Namely UTF-8?
  • Are you able to disable caching from the server end (if its practical)? pragma: no-cache or whatever its modern day equivalent is? (sorry, been a while since I played with this stuff).

Sorry, no real answer here, but too much to write as a comment.

Community
  • 1
  • 1
Sodved
  • 8,428
  • 2
  • 31
  • 43