Problem area
I need to define whether a particular path segment is valid against RFC2396. The spec says:
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
escaped = "%" hex hex
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
So, for example, /foo
is a valid path segment but /fo?o
isn't because of non-escaped ?
. To correct the above example, the path segment should be written as /fo%3Fo
.
Spec, however, only defines validity of URIs that arrive at server (think: typed in the URL bar).
What I actually need to validate is whether unescaped path segment is valid. Continuing above example, /fo?o
would be a valid resource as ?
is what you get when unescaping %3F
.
This also means that URL http://foo.com/first/sec%2fond
would resolve to two unescaped path segments, /first
and /sec/ond
, and the latter not only has to be treated as a single segment rather than two separate ones but is also syntactically valid (as an unescaped path segment).
Questions
- do I correctly understand the spec?
- can anyone suggest a Java validator for unescaped path segments?
- can anyone suggest a non-trivial failing case?
- how about characters above U+00FF, can they not be used in path segments? I thought they were supported, at least in the domain names.
EDIT: as Mike correctly pointed out, RFC3986 obsoleted RFC2396. Anyway, I believe the new RFC handles more cases than the old (and doesn't make some path segments illegitimate) hence the same questions apply.