Why do web browsers change file IRIs?

Question

The standard for file IRIs (https://www.rfc-editor.org/rfc/rfc8089) makes a distinction between file IRIs with no authority [1] and file IRIs with empty authority [2].

Modern web browsers (tested on Firefox and Chrome) automatically change [1] to [2]. E.g., when [1] appears in a link tag, the effective link followed is [2]. (No such rewriting rule is explained in the RFC document.)

[1] file:/C:/Program%20Files/Protege_2.1/2211#created_for
[2] file:///C:/Program%20Files/Protege_2.1/2211#created_for

Does anybody know why browsers are doing this and whether this is standards-compliant?

This results in real-world issues in Linked Data settings, where [1] and [2] denote distinct resources.

Well, look at [Appendix C](https://tools.ietf.org/html/rfc8089#appendix-C).. Also, it seems to me that these two protocol slashes are mandatory in the `file:` UR**L** scheme... Perhaps you could add more popular tags to you answer, e.g. [tag:firefox]. — Stanislav Kralin, Jan 29 '18 at 21:29

score 0 · Answer 1 · answered Oct 13 '18 at 08:59

When you enter the example URIs in the browser, or click on them as a link in a web document, the browser has to interpret the URI (which is given as a string) as a URL in order to locate a resource. It is in this interpretation/translation step from input string (denoting a valid URI) to valid URL that the one is changed into the other. In order to check if the given string indeed comprises a valid URL, the string is interpreted by a state machine and transformed to an in-memory representation of a URL. This state machine deals with the differences in URI representation of the two examples, but leads to the same in-memory representation of a URL. That is, no difference is represented in the in-memory representation between a case of no authority and a case of empty authority. Next, the in-memory representation of the URL is serialised back into a string, which is the actual URL string as seen in the browser after entering it. This serialisation simply always appends the colon double slash :// to the output string if the scheme of the in-memory representation is 'file'.

This behaviour is outlined in the WHATWG URL standard [1], see URL-Parsing (file-state) [2] and URL-Serializing [3].

Whether this serialisation issue is the result of stricter requirements for URLs as for URIs is something I was (also) wondering about, however Appendix A of RFC 8089 states that:

'According to the definition in [RFC1738], a file URL always started with the token "file://", followed by an (optionally blank) host name and a "/". The syntax given in Section 2 makes the entire authority component, including the double slashes "//", optional.'

Since that remark explicitly speaks of URL, I interpret that as the authority component being made optional for URLs (not just the broader URI definition) by the syntax in Section 2 of RFC 8089. The WHATWG URL standard seems to follow RFC1738 in that aspect. It actually considers two URLs as equivalent when the parsed and re-serialised output form both is equal, which is the case for your examples. Hence, it seems the behaviour is not up to the latest standards, RFC 8089 warns for this as well.

[1] https://url.spec.whatwg.org

[2] https://url.spec.whatwg.org/#file-state

[3] https://url.spec.whatwg.org/#url-serializing

Why do web browsers change file IRIs?

1 Answers1