Python 3 urlib urlparse URI parsing

Question

I'm a little bit puzzled. I hope somebody would help me =)

Python urlparse function result depends on a scheme that was specified in a URI.

For example, this call returns '/path;'

urllib.parse.urlparse('some://foo.bar/path;').path

But this call returns '/path'

urllib.parse.urlparse('http://foo.bar/path;').path

As I understand, the first variant is parsed as RFC 3986. But the second one is parsed as RFC 2396. Am I right? And what to do to parse any string as RFC 3986 describes it?

RFC 3986 explicitly "excludes those portions of RFC 1738 that defined the specific syntax of individual URL schemes". — Stop harming Monica, Jan 10 '18 at 15:30
I've done a little research and found the following facts: 1. Specific syntax of individual URL from RFC 1738 has been discarded by RFC 2396 (page 2). 2. This part of URI string that begins with ';' char calls "param" and defined in RFC 2396 (page 14) 3. RFC 3986 (page 22) does not support this "params" 4. RFC 1945 (HTTP 1.0 protocol) directly define params with ';' char 5. RFC 2616 (HTTP 1.1 protocol) looks like it does support "params" as it adopts this part from RFC 2396. 6. HTTP 2.0 (RFC 7540) uses RFC 3986 7. Apache server treats ';' char as a part of path — Ildar Gafurov, Jan 10 '18 at 17:45
Long story short: explicit is better than implicit. But in case of urlparse I cannot be sure how it will process my URI. I wish there is a simple way to parse a string as a URI defined in RFC 3896 — Ildar Gafurov, Jan 10 '18 at 17:51

score 2 · Accepted Answer · answered Jan 10 '18 at 21:11

2

If you don't want to split the parameters from the path then use urlsplit.

urllib.parse.urlsplit('http://foo.bar/path;')

Output

SplitResult(scheme='http', netloc='foo.bar', path='/path;', query='', fragment='')

answered Jan 10 '18 at 21:11

Stop harming Monica

12,141
1
36
56

Python 3 urlib urlparse URI parsing

1 Answers1