2

Problem area

I need to define whether a particular path segment is valid against RFC2396. The spec says:

path_segments = segment *( "/" segment )
segment       = *pchar *( ";" param )
param         = *pchar
pchar         = unreserved | escaped | ":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved    = alphanum | mark
mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
escaped       = "%" hex hex
hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                        "a" | "b" | "c" | "d" | "e" | "f"

So, for example, /foo is a valid path segment but /fo?o isn't because of non-escaped ?. To correct the above example, the path segment should be written as /fo%3Fo.

Spec, however, only defines validity of URIs that arrive at server (think: typed in the URL bar).

What I actually need to validate is whether unescaped path segment is valid. Continuing above example, /fo?o would be a valid resource as ? is what you get when unescaping %3F.

This also means that URL http://foo.com/first/sec%2fond would resolve to two unescaped path segments, /first and /sec/ond, and the latter not only has to be treated as a single segment rather than two separate ones but is also syntactically valid (as an unescaped path segment).

Questions

  • do I correctly understand the spec?
  • can anyone suggest a Java validator for unescaped path segments?
  • can anyone suggest a non-trivial failing case?
  • how about characters above U+00FF, can they not be used in path segments? I thought they were supported, at least in the domain names.

EDIT: as Mike correctly pointed out, RFC3986 obsoleted RFC2396. Anyway, I believe the new RFC handles more cases than the old (and doesn't make some path segments illegitimate) hence the same questions apply.

mindas
  • 26,463
  • 15
  • 97
  • 154
  • 1
    Are you sure you want to be using RFC 2396 and not RFC 3986 which obsoletes 2396? From RFC 3986 : "Obsoletes: 2732, 2396, 1808" – Mike Samuel Mar 30 '11 at 17:08
  • 1
    You’re wrong; `/first/sec%2fond` would result in three segments: an empty segment, `first`, and `sec/ond`. But why would you consider `sec/ond` not being a valid segment? Not valid for what? For being interpreted as a file (regular file or directory) in a file system? – Gumbo Mar 30 '11 at 17:15
  • @Gumbo - think of CMS where different path segments resolve to different CMS entities. I need to validate whether user-entered path segment is syntactically correct. – mindas Mar 30 '11 at 17:24
  • @mindas: Then what do you consider the correct syntax for path segments? – Gumbo Mar 30 '11 at 17:32
  • @Gumbo - see above, I given at least three examples (also hence my question about _incorrect_, or failing ones) – mindas Mar 30 '11 at 17:34
  • @mindas: I guess unless you can clearly say what these segments are used for any suggestion would be vague (unless you only allow such a strict set like alphanumeric characters only). – Gumbo Mar 30 '11 at 17:40
  • @Gumbo, as I said, each path segment is mapped to a resource in CMS system. You can think of it as "page" entity, if that makes it easier to understand. – mindas Mar 30 '11 at 17:43
  • @mindas: And how are these resources or entities identified? What’s wrong with identifying it with `sec/ond`? – Gumbo Mar 30 '11 at 17:53
  • @Gombo - absolutely nothing. What I want is to know whether there exist some resources that, according to RFC, cannot be a result of unescaping. – mindas Mar 30 '11 at 21:03

2 Answers2

2

I would interpret the specification in the same way you do; that is, sec%2Fond is a single path segment. (But—anyone who creates a URI with a segment like that should be punished severely!)

The problem you are wrestling with is that the un-escaping process is lossey; you cannot round-trip from escaped URI to un-escaped String and back to the original escaped URI. There's no way around this; you have to get hold of the escaped URI before any "helpful" processing discards that critical information.

You can read §2.1 for details on the handling of non-ASCII characters, but my understanding is that the escaping rules in RFC 2396 apply to an octet string (bytes) after the URI character string has been character-encoded. How the character encoding is performed may be specified by the scheme; there is no general method.

erickson
  • 265,237
  • 58
  • 395
  • 493
  • Finding correct resource to serve after lossy unescaping is a different problem, I did not want to pollute this question, this only deals with the validation part :) – mindas Mar 30 '11 at 17:21
2

So, for example, /foo is a valid path segment but /fo?o isn't because of non-escaped ?. To correct the above example, the path segment should be written as /fo%3Fo.

Correct

This also means that URL http://foo.com/first/sec%2fond would resolve to two unescaped path segments, /first and /sec/ond, and the latter has not only has to be treated as a single segment rather than two separate ones but is also syntactically valid (as an unescaped path segment).

Correct. There are many implementations that get this wrong though.

how about characters above U+00FF, can they not be used in path segments? I thought they were supported, at least in the domain names.

URI escapes (% hex hex) encode bytes. Not code-points. You need to know the encoding of the URL. For example, if the encoding is UTF-8, then codepoint U+1234 is encoded as %E1%88%B4.

Percent escapes aren't allowed in domain names. For international domain names see RFC 3492.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245