5

Suppose a browser encounters a <meta> tag that specifies the character-encoding, like this:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

Does it start over from the beginning parsing the page again, since some of the preceding characters in the <head> section may have been interpreted incorrectly? Or are there some other constraints that prevent prior characters from being interpreted incorrectly?

Joel Lee
  • 3,656
  • 1
  • 18
  • 21

3 Answers3

4

As far as I know, browsers wont go back after finding a charset declaration in the <head> and they assume a ASCII compatible charset up to that point. Unfortunately I can't find a reference to confirm this.

Confirming browsers will ignore a Content-Type meta element, if the server already provides a Content-Type HTTP header, so you can't override a "wrong" server-side charset with a <meta> element.

The point for the <meta> charset declaration is for HTML documents that are not server by a HTTP server.

That means you shouldn't rely on a <meta> charset declaration in the HTML file, but configure your HTTP server to provide the correct charset. If for some reason you have to rely on a <meta> charset declaration, you should only have ASCII characters up to that point and position it as early in the <head> as possible, preferably as the first element.

RoToRa
  • 37,635
  • 12
  • 69
  • 105
  • 1
    Thanks. I found a reference regarding "assume an ASCII compatible charset up to that point." It's at (surprise!) W3C: http://www.w3.org/TR/html4/charset.html#h-5.2.2. Good advice regarding placement of the tag, if it's needed. – Joel Lee Mar 28 '11 at 15:50
1

The parser can start over in some circumstances. The relevant spec is here: http://dev.w3.org/html5/spec/parsing.html#change-the-encoding

Note that browsers traditionally have probably not followed this algorithm exactly; chances are they've all done slightly different things. However, the link above describes what HTML5 compliant browsers should do. The algorithm described is likely an amalgam of various browsers previous behaviour.

Since HTML5 is still a working draft, this should be considered subject to change.

Alohci
  • 78,296
  • 16
  • 112
  • 156
0

It has no real effect on the node structure. Only the content of text nodes (and attribute nodes) has to be transcoded.

If your server sends the

Content-type: text/html;charset=utf-8

...header the browser will know the right charset from the start. You can acieve ths with a .htaccess file containing:

AddDefaultCharset utf-8
vbence
  • 20,084
  • 9
  • 69
  • 118
  • But presumably it can happen that the meta tag specifies a different character set than the one in the `Content-type` header, otherwise there doesn't seem to be any point in using the meta tag for this. And although no document nodes have to be re-parsed, couldn't you still have gotten something wrong in the preceding part of the `` section? (e.g. a string value in some Javascript"). – Joel Lee Mar 28 '11 at 14:20