Why is it that "using anything but a utf-8 decoder...might be insecure" in a URL percent decoding algorithm?

Question

I am implementing a URL parser and have a question about the W3C URL spec (at http://www.w3.org/TR/2014/WD-url-1-20141209/ ) In section "2. Percent-encoded bytes" it has the following algorithm (emphasis added):

To percent decode a byte sequence input, run these steps:

Using anything but a utf-8 decoder when the input contains bytes outside the range 0x00 to 0x7F might be insecure and is not recommended.

Let output be an empty byte sequence.

For each byte byte in input, run these steps:

If byte is not '%', append byte to output.

Otherwise, if byte is '%' and the next two bytes after byte in input are not in the ranges 0x30 to 0x39, 0x41 to 0x46, and 0x61 to 0x66, append byte to output.

Otherwise, run these substeps:

Let bytePoint be the two bytes after byte in input, decoded, and then interpreted as hexadecimal number.

Append a byte whose value is bytePoint to output.

Skip the next two bytes in input.

Return output.

In the original spec, the word "decoded" (in bold above) is a link to a UTF-8 decoding algorithm. I assume this is the "utf-8 decoder" referred to in the second sentence (italicized) above.

I understand that invalid sequences of UTF-8 bytes can cause security problems. However, in the step that uses the decoder, the bytes have already been verified as valid ASCII hex digits by the preceding sub-substep 2, so it seems that using a UTF-8 decoder here for security is overkill.

Can anyone explain how using something other than a UTF-8 decoder in this algorithm could possibly be insecure, when the decoder will only be used for byte values in the ranges 0x30 to 0x39, 0x41 to 0x46, and 0x61 to 0x66? Or am I interpreting something incorrectly in the spec?

It seems to me that any bytes outside the range 0x00 to 0x7f will simply be copied to output as-is (either in substep 1 because they are not %, or in sub-substep 2 because they are not ASCII hex digits), so they will never end up in a decoder in this algorithm.

Why is it that "using anything but a utf-8 decoder...might be insecure" in a URL percent decoding algorithm?

0 Answers0