(Context: I'm writing an HTML sanitiser, and want to normalize URLs as a defence-in-depth measure, making it impossible to use abnormally escaped URLs to bypass downstream blacklists (I'm not relying on blacklists myself) or mislead users.)
When given a URL, in what contexts can a character be changed to its percent-encoded version, or vice versa, without changing the meaning of the URL?
What I've been able to conclude so far:
- In the path portion of a URL,
/
is not equivalent to its escaped form%2F
- The separator
?
between the path and query string is not equivalent to its escaped form%3F
(presumably the same rule also applies to the fragment separator#
) - For the special cases of
.
and..
within a hierarchical path,.
is equivalent to%2E
according to the specification - Some characters, such as
^
, are illegal in URLs, and thus must only appear in encoded form – the decoded form is not equivalent because it can't be used at all - I don't have a second-hand source for this, but all the software I've tested agrees that percent-encoded domain names are equivalent to the corresponding decoded versions (e.g.
ex%61mple.com
is equivalent toexample.com
in the host part of a URL) – this makes sense because%
,/
, and illegal-in-URL characters are all illegal in domain names anyway, so escaping could not possibly be of use %
cannot be equivalent to its encoded form%25
, otherwise there would be no way to escape the escape characterapplication/x-www-form-urlencoded
is a commonly (although not universally) used format for URL query strings, and in that format,=
,+
,&
are not equivalent to%3D
,%2B
,%26
respectively; thus these equivalences cannot hold in URL query strings
However, I'm finding it unclear what the correct action to take with real-world URLs is in other cases, especially as real-life URL parsing libraries tend not to match the specification exactly. In particular:
- Should I be percent-decoding characters in the path portion of a URL that are URL-safe (other than
%/?#
) but have been unexpectedly encoded anyway? The most common software behaviour that I've seen for URLs likehttp://example.com/ind%65x.html
is to treat them as distinct URLs fromhttp://example.com/index.html
(e.g. they appear differently in logs and don't compare as equal), but to actually handle the two "distinct" URLs the same way. I don't know whether this is an implementation detail, or whether it's some sort of compatibility workaround. - Should I be decoding any characters in query strings? If so, which?
- Should I be decoding any characters in fragments? If so, which?
There seem to be competing standards on this subject, and real-world application behaviour might not match any of them, so I'm interested in knowing how far I can go with URL normalization without breaking real-world use cases. (It would also be helpful to know in which situations escaped characters might be technically different in meaning from the non-escaped versions, but in which escaping them would have no legitimate uses – a sanitiser could have an option to reject URLs that escaped these characters as being likely to be malicious.)