1

I have a URL string as:

url = "https://foo.bar.com/path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=?&339286293"

when using Python

from urllib.parse import urlparse

url_obj = urlparse(url)
url_obj.path  # `path/to/aaa.bbb/ccc.ddd`

when using ruby

url_obj = URI.parse(url)

url_obj.path # `path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=`

I guess python is consider ; is not part of the url path, which one is 'correct'?

Zou Googi
  • 43
  • 1
  • 6
  • 1
    according to rfc it should be allowed ... maybe a bug with urlparse (but it does seem to be available in `url_obj.params` – Joran Beasley Jul 12 '21 at 04:44
  • 1
    @JoranBeasley AFAIK `;` was recommended (at least at some point in the past) as an alternative to `&` as a query parameter delimiter. That would only apply *after* a `?` though, but perhaps that's behind what Python is doing. – mu is too short Jul 12 '21 at 05:42
  • @muistooshort understood, this example url I given is capture from browser, and I have to parse and use it so I we have no choice but deal with it. – Zou Googi Jul 12 '21 at 16:53

2 Answers2

6

urlparse takes the part of path after the first semicolon as params:

url_obj.path   # '/path/to/aaa.bbb/ccc.ddd'
url_obj.params # 'dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent='

To replicate Ruby's behaviour, use urlsplit instead:

This is similar to urlparse(), but does not split the params from the URL. This should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted.

from urllib.parse import urlsplit

url_obj = urlsplit(url)
url_obj.path  # '/path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent='
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • thanks Amadan, the issue for me is I am translating a python script to a ruby equivalent, so I guess we found a 'bug' of the original python script and do the correct behavior in the ruby version. at least this help to fix that. – Zou Googi Jul 12 '21 at 16:46
2

Python's urllib is wrong. RFC 3986 Uniform Resource Identifier (URI): Generic Syntax, Section 3.3 Path explicitly gives this exact syntax as an example for a valid path [bold emphasis mine]:

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.

The correct interpretation of the example URI you posted is the following:

  • scheme = https
  • authority = foo.bar.com
    • userinfo = empty
    • host = foo.bar.com
    • port = empty, derived from the scheme to be 443
  • path = /path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=, consisting of the following four path segments:
    1. path
    2. to
    3. aaa.bbb
    4. ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=
  • query = &339286293
  • fragment = empty
Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
  • 1
    Not really wrong. `urlparse` is being "helpful", by splitting parameters (per quoted RFC) of the last path segment into its own field. There is a function in `urllib` that does not do this, so I'd rather say `urlparse` is odd, than `urllib` as a package is wrong. – Amadan Jul 12 '21 at 08:34
  • thanks for the detail, at least for now I just follow the parse result form ruby's and ignore the py version – Zou Googi Jul 12 '21 at 16:41