Over the years from reading the evolving specs I had assumed that RFC 3986 had finally settled on UTF-8 encoding for escape octet sequences. That is, if my URI has %XX%YY%ZZ
I can take that sequence of decoded octets (for any URI in the scheme-specific part) and interpret the resulting bytes as UTF-8 to find out what decoded information was intended. In practical terms, I can call JavaScript decodeURIComponent()
which does this decoding automatically for me.
Then I read the spec for data:
URIs, RFC 2397, which includes a charset
argument, which (naturally) indicates the charset of the encoded data. But how does that work? If I have a two-octet encoded sequence %XX%YY
in my data:
URI, does a charset=iso-8859-1
indicate that the two decoded octects should not be interpreted as a UTF-8 sequence, but as as two separate Latin characters (as each byte in ISO-8859-1 represents a character)? RFC 2397 seems to indicate this, as it gives an example of "greek [sic] characters":
data:text/plain;charset=iso-8859-7,%be%fg%be
But this means that JavaScript decodeURIComponent()
(which assumes UTF-8 encoded octets) can't be used to extract a string from a data URI, correct? Does this mean I have to create my own decoding for data URIs if the charset is something besides UTF-8?
Furthermore, does this mean that RFC 2397 is now in conflict with RFC 3986, which seems to indicate that UTF-8 is assumed? Or does RFC 3986 only refer "new URI scheme[s]", meaning that the data:
URI scheme gets grandfathered in and has its own technique for specifying what the encoded octets means?
My best guess at the moment is that data:
plays by its own rules and if it indicates a charset other than UTF-8, I'll have to use something other than decodeURIComponent()
in JavaScript. Any recommendations on a replacement method would be welcome, too.