0

(Context: I'm writing an HTML sanitiser, and want to normalize URLs as a defence-in-depth measure, making it impossible to use abnormally escaped URLs to bypass downstream blacklists (I'm not relying on blacklists myself) or mislead users.)

When given a URL, in what contexts can a character be changed to its percent-encoded version, or vice versa, without changing the meaning of the URL?

What I've been able to conclude so far:

  • In the path portion of a URL, / is not equivalent to its escaped form %2F
  • The separator ? between the path and query string is not equivalent to its escaped form %3F (presumably the same rule also applies to the fragment separator #)
  • For the special cases of . and .. within a hierarchical path, . is equivalent to %2E according to the specification
  • Some characters, such as ^, are illegal in URLs, and thus must only appear in encoded form – the decoded form is not equivalent because it can't be used at all
  • I don't have a second-hand source for this, but all the software I've tested agrees that percent-encoded domain names are equivalent to the corresponding decoded versions (e.g. ex%61mple.com is equivalent to example.com in the host part of a URL) – this makes sense because %, /, and illegal-in-URL characters are all illegal in domain names anyway, so escaping could not possibly be of use
  • % cannot be equivalent to its encoded form %25, otherwise there would be no way to escape the escape character
  • application/x-www-form-urlencoded is a commonly (although not universally) used format for URL query strings, and in that format, =, +, & are not equivalent to %3D, %2B, %26 respectively; thus these equivalences cannot hold in URL query strings

However, I'm finding it unclear what the correct action to take with real-world URLs is in other cases, especially as real-life URL parsing libraries tend not to match the specification exactly. In particular:

  • Should I be percent-decoding characters in the path portion of a URL that are URL-safe (other than %/?#) but have been unexpectedly encoded anyway? The most common software behaviour that I've seen for URLs like http://example.com/ind%65x.html is to treat them as distinct URLs from http://example.com/index.html (e.g. they appear differently in logs and don't compare as equal), but to actually handle the two "distinct" URLs the same way. I don't know whether this is an implementation detail, or whether it's some sort of compatibility workaround.
  • Should I be decoding any characters in query strings? If so, which?
  • Should I be decoding any characters in fragments? If so, which?

There seem to be competing standards on this subject, and real-world application behaviour might not match any of them, so I'm interested in knowing how far I can go with URL normalization without breaking real-world use cases. (It would also be helpful to know in which situations escaped characters might be technically different in meaning from the non-escaped versions, but in which escaping them would have no legitimate uses – a sanitiser could have an option to reject URLs that escaped these characters as being likely to be malicious.)

ais523
  • 657
  • 4
  • 8

1 Answers1

2

I hope this may provide some insight to your question:

We should only encode the individual components of the ur (example query parameters and fragments), excluding the domain name, that may contain unsafe symbols. Please note, the different components have different rules of what characters need to be encoded and which ones do not. Please read here [https://datatracker.ietf.org/doc/html/rfc3986].

In general, you may follow below:

  1. These unreserved Characters Need not be encoded: ALPHA (uppercase and lowercase) / Decimal Digits / "-" / "." / "_" / "~"   

  2. The space character is converted into a plus sign "+" and should not trigger encoding.

  3. All other characters (unsafe, reserved characters if not used for their reserved purposes) should be encoded. Below is a list of such characters (it may include a few more):


    ! * ' ( ) ; : @ & = + $ , / ? # [ ] % { } | \ ^ 

Akki
  • 775
  • 8
  • 19
  • This is indeed helpful (although it isn't a complete answer, it's a good start), so I've upvoted but not accepted it. I'd been coming to a similar conclusion: if a character has a special meaning in URLs, you need to leave its percent-encoding alone (because the encoded and unencoded versions have different meanings); if it's disallowed in URLs, it must always be percent-encoded; and in other cases, it should be equivalent to the percent-encoded version. There are still some unresolved questions, though, e.g. do these rules still apply in the query and fragment portions? – ais523 Sep 16 '21 at 14:37
  • @ais523, I think these rules applied well for query and fragment. At least I had a similar requirement recently for the query component and it worked perfectly well for me. here is the details: https://stackoverflow.com/q/69149676/5685911 – Akki Sep 16 '21 at 18:34