Why java.net.URI does not encode all reserved characters? (java 11)

Question

I want to create an instance of java.net.URI using individual URI components, namely:

scheme
userInfo
host
port
path
query
fragment

There is a constructor in java.net.URI class that allows me to do it, here is a code from the library:

 public URI(String scheme,
               String authority,
               String path, String query, String fragment)
        throws URISyntaxException
    {
        String s = toString(scheme, null,
                            authority, null, null, -1,
                            path, query, fragment);
        checkPath(s, scheme, path);
        new Parser(s).parse(false);
    }

This constructor will also encode path, query, and fragment parts of the URI, so for example if I pass already encoded strings as arguments, they will be double encoded.

JavaDoc on this function states:

If a path is given then it is appended. Any character not in the unreserved, punct, escaped, or other categories, and not equal to the slash character ('/') or the commercial-at character ('@'), is quoted.
If a query is given then a question-mark character ('?') is appended, followed by the query. Any character that is not a legal URI character is quoted.
Finally, if a fragment is given then a hash character ('#') is appended, followed by the fragment. Any character that is not a legal URI character is quoted.

it states that unreserved punct and escaped characters are NOT quoted, punct characters include:

!
#
$
&
'
(
)
*
+
,
;
=
:

According to RFC 3986 reserved characters are:

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

So, if characters @, / and + are reserved, and should always be encoded (or I'm I missing something?), according to the most up to date RFC on URIs, then why does java.net.URI JavaDoc states that it will not encode punct characters (which includes + and =), @ and /?

Here is a little example I ran:

String scheme = "http";
String userInfo = "username:password";
String host = "example.com";
int port = 80;
String path = "/path/t+/resource";
String query = "q=search+term";
String fragment = "section1";

URI uri = new URI(scheme, userInfo, host, port, path, query, fragment);

uri.toString // will not encode `+` in path.

I don't understand, if this is correct behavior and those characters indeed don't need to be encoded, then why are they referred to as "reserved" in an RFC? I'm trying to implement a function that will take a whole URI string and encode it (hence extract path, query, and fragment, encode reserved characters in them, and put the URI back together).

Sweeper · Answer 1 · 2023-01-15T05:16:18.410

It is exactly because that these characters are reserved, that Java's API does not encode them. Being reserved means that they have special meaning when they are not escaped:

from the same section of the RFC you linked:

The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI is
interpreted by most applications. Thus, characters in the reserved
set are protected from normalization and are therefore safe to be
used by scheme-specific and producer-specific algorithms for
delimiting data subcomponents within a URI.

If java.net.URI always escaped them, then you would not be able to express whatever special meaning the reserved characters have. You would be only able to create

http://username:password@example.com:80/path/t%2B/resource?q=search+term#section1

but not

http://username:password@example.com:80/path/t+/resource?q=search+term#section1

which can be URIs that mean different things, according to the RFC.

Further down that section, it is also said that

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.

In other words, if "these characters are specifically allowed by the URI scheme to represent data in that component", then "URI producing applications should NOT percent-encode data octets...". This is very much the case in the path component, which uses a subset of the reserved characters - /, @, :, and everything in "sub-delims".

This matches what the JavaDoc says about what it doesn't escape. Note that the wording in the JavaDoc (words like "escaped" and "punct") is actually from an older RFC, RFC 2396. With a bit of careful checking, you can see that they are indeed equivalent in this regard.

Stephen C · Answer 2 · 2023-01-15T05:18:52.783

So, if characters @, / and + are reserved, and should always be encoded (or I'm I missing something?) ...

The URI spec states:

"URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII."

Firstly, it clearly does not say reserved characters "... should always be encoded"

For example, if we examine the syntax rules for path we see:

  pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

and '+' is in sub-delims.

So + should NOT be (automatically) encoded in a path.

Just edited that quote into my own answer, and then saw yours, haha. — Sweeper, Jan 15 '23 at 05:17

Why java.net.URI does not encode all reserved characters? (java 11)

2 Answers2