0

I'm authoring HTML5 documents and was a little surprised that the default text encoding (without HTTP headers or meta element setting it) defaults to windows-1252 on the browsers that I have tested (Safari, Chrome, Firefox - recent versions as of Feb 2023, macOS).

In particular, I'm using the <!DOCTYPE html> but forgot to add the <meta charset="utf-8"> element. If I open the file locally, browsers perform auto-detection and use utf-8 when non-ascii chars are present - but not if files are served through a web server.

I understand that browsers can't simply default to utf-8 for all HTML files due to old content and auto-detection for HTTP served content is hard (reasoning described here https://hsivonen.fi/utf-8-detection/).

What I don't understand, however, is why a modern HTML5 document in standards mode (with doctype set) does not also use utf-8 by default?

Edit: The similar Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8? question asks why one needs to set the encoding if one (wrongly) assumes utf-8 as default, not what the default is (or how it's selected).

blazee
  • 168
  • 1
  • 8
  • It would make no sense to define a default charset. UTF-8 is not even useful as default as asians require UTF-16 (which is heavyweight for the rest). Also, older databases and websites are still encoded as ISO-8859-1. Declaring a default charset could cause potential incompatibility issues. As such it is easier and more redundant to simply enforce the usage of charset by W3C definitions. If any default would be sued then ISO-8859-1 would be more lightweight as it is a single byte encoding compared to the multibyte encoding of UTF-8 and UTF-16 – tacoshy Feb 15 '23 at 15:21
  • possible duplicate https://stackoverflow.com/questions/52351400/why-its-necessary-to-specify-the-character-encoding-in-an-html5-document-if-the – exa.byte Feb 15 '23 at 15:23
  • 1
    @tacoshy: It makes sense to have _some_ default. It also makes sense for the default to be ascii-compatible (which rules out UTF-16). As for older websites, they had to add doctype for HTML5, so they might as well add a charset stanza. I understand why HTML4 and older don't use utf-8 but I'm asking for HTML5 with doctype set, specifically. – blazee Feb 15 '23 at 15:31
  • But as the duplicate said, you must keep performance in mind and lower the encoding cost. As such it is ISO-8859-1 as it is single-byte encoding. – tacoshy Feb 15 '23 at 15:33
  • 1
    If the general recommendation (and practice, from what I understand of current systems) is to use utf-8 or a more "expensive" encoding, that's not a strong argument. Perhaps you can clarify what performance you have in mind? – blazee Feb 15 '23 at 15:42
  • PS: The ISO-8859-1 is not the default (might be for some locales but certainly not all). Perhaps the question is then also, why complicate the selection algorithm with locale-specifics? – blazee Feb 15 '23 at 15:57
  • Does this answer your question? [Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8?](https://stackoverflow.com/questions/52351400/why-its-necessary-to-specify-the-character-encoding-in-an-html5-document-if-the) – Rob Feb 15 '23 at 21:42
  • @tacoshy *"asians require UTF-16"* — Wat?! That is plainly false. – deceze Feb 16 '23 at 11:00

1 Answers1

0

Through this question (thanks exa.byte and Rob!) and the HTML spec I believe I was able to piece together an answer.

Short answer: No, HTML5 has no default character encoding (but read on).

Long answer: Obviously browsers will use some encoding to display the page. When none is specified, the algorithm first uses auto-detection. In my testing browsers actually do this for local files (url starting with file://) and some might even do it for remote files but the standard encourages not doing this for remote files beyond the first 1kb (this is where the meta charset tag has to be). Limiting to first 1kb is recommended to not stall parsing for too long. Browsers can also entirely skip the auto-detection step if they want (this is what Firefox does for remote files I believe).

Side note: Above no encoding specified means no BOM, no Content-Type with charset, no meta tag, no inherited from parent iframe, and no XML declaration (yes, this is used for text/html too).

So, if auto-detection didn't select the encoding, such as having multiple possibilities or browser didn't have enough data available at the time, the browser selects an implementation-defined option. This can be browser-dependent but HTML5 suggests utf-8 for controlled environments or locale-based default (#9 here) otherwise.

Finally, to explain the behavior I saw with getting the windows-1252 encoding. The reason was because a) auto-detection failed (the non-ascii characters were at the end of page) and b) the browsers I use selected it based on my preferred/selected locale.

blazee
  • 168
  • 1
  • 8