Charset in data URI

Question

Over the years from reading the evolving specs I had assumed that RFC 3986 had finally settled on UTF-8 encoding for escape octet sequences. That is, if my URI has %XX%YY%ZZ I can take that sequence of decoded octets (for any URI in the scheme-specific part) and interpret the resulting bytes as UTF-8 to find out what decoded information was intended. In practical terms, I can call JavaScript decodeURIComponent() which does this decoding automatically for me.

Then I read the spec for data: URIs, RFC 2397, which includes a charset argument, which (naturally) indicates the charset of the encoded data. But how does that work? If I have a two-octet encoded sequence %XX%YY in my data: URI, does a charset=iso-8859-1 indicate that the two decoded octects should not be interpreted as a UTF-8 sequence, but as as two separate Latin characters (as each byte in ISO-8859-1 represents a character)? RFC 2397 seems to indicate this, as it gives an example of "greek [sic] characters":

data:text/plain;charset=iso-8859-7,%be%fg%be

But this means that JavaScript decodeURIComponent() (which assumes UTF-8 encoded octets) can't be used to extract a string from a data URI, correct? Does this mean I have to create my own decoding for data URIs if the charset is something besides UTF-8?

Furthermore, does this mean that RFC 2397 is now in conflict with RFC 3986, which seems to indicate that UTF-8 is assumed? Or does RFC 3986 only refer "new URI scheme[s]", meaning that the data: URI scheme gets grandfathered in and has its own technique for specifying what the encoded octets means?

My best guess at the moment is that data: plays by its own rules and if it indicates a charset other than UTF-8, I'll have to use something other than decodeURIComponent() in JavaScript. Any recommendations on a replacement method would be welcome, too.

score 7 · Answer 1 · answered May 25 '13 at 19:02

Remember that the data: URI scheme describes a resource that can be thought of as a file which consists of an opaque bytestream just as though it were a http: URI (the same bytestream, but stored on an HTTP server) or an ftp: URI (the same bytestream, but stored on an FTP server) or a file: URI (the same bytestream, but stored on your local filesystem). Only the metadata attached to the file gives the bytestream meaning.

RFC 2397 gives a clear specification on how this bytestream is to be embedded in the URI itself (in contrast to other URI schemes, where the URI gives instructions on where to fetch the bytestream, not what it contains). It might be base64 or it might be the percent-encoding method given in the RFC. Base64 is going to be more compact if the bytestream contains man non-ASCII bytes.

The data: URI also describes its own Content-Type, which gives the intended interpretation of the bytestream. In this case, since you have used text/plain;charset=iso-8859-7, the bytes must be correctly encoded ISO-8859-7 text. The bytes will definitely not be decided as UTF-8 or any other character encoding. It will be unambiguously decoded using the character encoding you have specified.

But assuming you transfer it to the webpage.. how is the webpage supposed to know where the opaque ends in `text/plain;charset=iso-8859-7,opaque`? It should therefore first decode it using UTF-8 as declared by the HTTP header, before decoding it using iso-8859-7. — Pacerier, May 14 '20 at 08:41
The UTF-8 decoding would happen by the URL handling stuff. What data probably does is *reencode* the data using UTF-8 to get a bytestream when not in base64 mode prior to treating it like an opaque bytestream. So if your data has bit sequences which cannot be represented by UTF-8, then your only option is to encode your data as base64 as the URL parser is (probably?) allowed to replace or error on invalid UTF-8 sequences in its input. — binki, Aug 17 '23 at 15:01

Charset in data URI

1 Answers1