-2

From the W3C:

If an HTML document does not start with a BOM, and its encoding is not explicitly given by Content-Type metadata, and the document is not an iframe srcdoc document, then the character encoding used must be an ASCII-compatible character encoding

So How I can add a BOM which would mean the document is encoded in UTF16 for example?

user2284570
  • 2,891
  • 3
  • 26
  • 74
  • Do any browsers even reliably support those encodings? – Pekka Nov 17 '13 at 15:57
  • How are you encoding the document in those encodings in the first place? You should configure whatever tool you are using to set the encoding to include the BOM at the same time. – Quentin Nov 17 '13 at 15:59
  • @Pekka Unicode are standards for many years now. DO you know well-know program which don't support UTF32? :) – user2284570 Nov 17 '13 at 16:01
  • @user do you see a UTF-32 encoding in Chrome's or Firefox's encoding menus? – Pekka Nov 17 '13 at 16:05
  • @Quentin : like many peoples I use a text/advanced text editor for editing HTML files. Usually you choose to set the character encoding when you record the file. It re-open with right encoding in the editor, but you've got very strange result if you do the same in a web-browser (tried webkit/blink/presto/trident) :) . Since utf-8 is ascii compatible you just need to specify the right element. But According to the W3C, you need to add a BOM at the beginning of the HTML document to let character encoding like UTF-16 or UTF-323 work properly. – user2284570 Nov 17 '13 at 16:07
  • @Pekka: For firefox and chrome, I don't know... but seen in midori, thunderbird and Opera – user2284570 Nov 17 '13 at 16:17
  • It's not there in FF and Chrome. UTF-8 is generally the way to go on the Web at the moment. – Pekka Nov 17 '13 at 16:17
  • @Pekka웃 : Yes, with the exception of some charsets.... Thunderbird use gecko midori webkit and Opera blink... I'm seeing no reason why it shouldn't be in chrome and FF. – user2284570 Nov 17 '13 at 16:20
  • 1
    UTF8 can handle all character sets. That said, you haven't told us your operating system or editor. It will be an editor option to provide the BOM. For example, on Windows, Notepad++ can set the encoding to `UTF8 with BOM` or `UTF16` (which provides the BOM). – Mark Tolonen Nov 17 '13 at 17:43
  • @MarkTolonen: Read the comments... I don't really care about the editor for the conversion. In reality, I used the `iconv` command. – user2284570 Nov 17 '13 at 18:13
  • @Pekka웃 : [Firefox support UTF-32](https://bugzilla.mozilla.org/show_bug.cgi?id=604317 "Mozilla official bugzilla") – user2284570 Nov 17 '13 at 18:14
  • http://superuser.com/questions/381056/iconv-generating-utf-16-with-bom – Mark Tolonen Nov 17 '13 at 18:36
  • @MarkTolonen Ah Ok.... I thought the byte oder mask referred to the html language like the element do. I didn't know it was about the files in general... I don't have necessary rights for migrating this question, since in that case, it is not programming related. – user2284570 Nov 17 '13 at 18:39
  • note: the main advantage of UTF-32 is that it is a fixed length encoding. – user2284570 Nov 17 '13 at 19:56
  • Now it's off topic, I would prefer this question migrated rather than closed... – user2284570 Nov 21 '13 at 09:54

3 Answers3

2

You add a BOM by inserting U+FEFF (which is what the BOM is by definition) at the very start of the data. How you do that depends on how you are generating UTF-16 or UTF-32 in the first place.

The “rephrased” question “how I can display an utf-16/utf-32 encoded html document?” is really a different, and the short answer is: mostly, you don’t. There is hardly any reason to use utf-16 or utf-32 for an HTML document. The recommendations clearly favor utf-8. But if you use utf-16 or utf-32, then you should primarily take care of Content-Type header, and additionally include a BOM.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • Again the content type header is for http only, not in e-mails... I recognize there are some problems according to the official [Unicode website](http://www.unicode.org/faq/utf_bom.html#utf16-5) which make some peoples prefer non standard encodings... – user2284570 Nov 17 '13 at 18:23
  • 2
    @user UTF-8 is very much a standard encoding. As far as the web is concerned, it currently is the lowest common denominator and the ideal way to go. – Pekka Nov 17 '13 at 18:33
  • @user2284570, proper e-mail format (MIME) has Content-Type headers. – Jukka K. Korpela Nov 17 '13 at 19:18
1

The hint is here:

its encoding is not explicitly given by Content-Type metadata

You should try that (by HTTP headers or by etc.) For inserting the BOM, your code editor should be able to do that.

Please also see the W3C specs:

Most of the time you are probably better off choosing UTF-8 as your encoding. [...] One reason for this is that there are special rules for declaring the encoding of a UTF-16 page.

Whether you use element-based declarations or not, you should ensure that you always have a byte-order mark at the very start of a UTF-16 encoded file. In effect, this is the in-document declaration.

Furthermore, if your page is encoded as UTF-16, do not declare your file to be "UTF-16BE" or "UTF-16LE", use "UTF-16" only. The byte-order mark at the beginning of your file will indicate whether the encoding scheme is little-endian or big-endian. (This is because content explicitly encoded as, say, UTF-16BE should not use a byte-order mark; but HTML5 requires a byte-order mark for UTF-16 encoded pages.)

http://www.w3.org/International/questions/qa-html-encoding-declarations#utf16

Scorchio
  • 2,763
  • 2
  • 20
  • 28
  • the http header is well-know to for dealing with this... unfortunately it won't work in e-mails. I need a general-purpose solution.`For inserting the BOM, your code editor should be able to do that.` : not found in vi geany windows wordpad.exe and notepad.exe. I guess I have to do it manually but I don't know how. That's the purpose of this question. – user2284570 Nov 17 '13 at 16:14
  • @user2284570: For vi(m), please see this one: http://vim.wikia.com/wiki/Working_with_Unicode, especially "'bomb' (boolean): if set, vim will put a "byte order mark" (or BOM for short) at the start of Unicode files." – Scorchio Nov 17 '13 at 17:26
  • @user2284570: Geany: http://www.geany.org/manual/#character-sets-and-unicode-byte-order-mark-bom On Windows, use Notepad++ instead; it can handle BOM well. – Scorchio Nov 17 '13 at 17:27
  • In plain old notepad.exe, use Unicode as the save option for what is really UTF16. It will have a BOM. There is a UTF8 option as well. It will also have a BOM. – Mark Tolonen Nov 17 '13 at 17:46
  • @MarkTolonen : I doubt of that... currently the page is correctly displayed... but as text file not html. This is due to the encoding... – user2284570 Nov 17 '13 at 18:27
  • Use a hex editor and look at the beginning of the file. `FF FE` is a little-endian UTF-16 BOM. – Mark Tolonen Nov 17 '13 at 18:38
0

The byte order mask is an hex sequence which can be put at the beginning of any file.
It has nothing to do with the html/other web languages.

An hex editor is a good way to add it.

Although UTF-32 offer the advantage of fixed length encoding, some browser/e-mail client dropped the support for it.

note: UTF-16 is mainly used on windows.

user2284570
  • 2,891
  • 3
  • 26
  • 74