22

An example HTML document retrieved over HTTP lacks:

  • a HTTP Content-Type header
  • a HTML <meta charset="<character encoding>" />
  • a HTML <meta http-equiv='Content-Type' content='Type=text/html; charset=<character encoding>'>

With regards to HTML5, is a default, for example UTF-8, assumed as the character encoding? Or is it entirely up the application reading the HTML document to choose a default?

Jon Cram
  • 16,609
  • 24
  • 76
  • 107

1 Answers1

20

The charset is determined using these rules:

  1. User override.
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A Byte Order Mark before any other data in the HTML document itself.
  4. A META declaration with a "charset" attribute.
  5. A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
  6. Unspecified heuristic analysis.

...and then...

  1. Normalize the given character encoding string according to the Charset Alias Matching rules defined in Unicode Technical Standard #22.
  2. Override some problematic encodings, i.e. intentionally treat some encodings as if they were different encodings. The most common override is treating US-ASCII and ISO-8859-1 as Windows-1252, but there are several other encoding overrides listed in this table. As the specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

But the most important thing is:

You should always specify a character encoding on every HTML document, or bad things will happen. You can do it the hard way (HTTP Content-Type header), the easy way (<meta http-equiv> declaration), or the new way (<meta charset> attribute), but please do it. The web thanks you.

Sources:

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
  • Thanks, I appreciate that a character encoding should always be defined. I'm validating documents over which I have no control and need to be aware of whether I should revert to a default encoding if none is specified. – Jon Cram Sep 13 '12 at 12:27
  • 1
    Consider using the same logic the W3 validator uses. It's open source so you can just look at its code to see what it does. – ThiefMaster Sep 13 '12 at 12:43
  • 1
    This doesn't really answer the question of why the character set is needed, and what the default set is that is so bad. Also, is this still the answer? It has been 2 years since this was written, and browsers have been upgraded a lot since then. IE hasn't, but older versions have fallen away. – trysis Feb 07 '18 at 16:45
  • @trysis: ["The Encoding standard requires use of the UTF-8 character encoding"](https://html.spec.whatwg.org/multipage/semantics.html#charset). See [Require UTF-8](https://github.com/whatwg/html/commit/fae77e3c558b9f083dfb9086752863a4789268f5) – jfs Feb 26 '18 at 08:01
  • So you are saying things like `meta charset` are now required, meaning if you don't provide at least one browsers can do whatever they want & will probably use a horrible legacy value and not UTF-8? – trysis Feb 26 '18 at 18:59
  • https://stackoverflow.com/questions/14669352/is-the-charset-meta-tag-required-with-html5 seems to suggest otherwise. – Marcus Feb 25 '19 at 20:17