2

I am trying to use urlparse Python library to parse some custom URIs.

I noticed that for some well-known schemes params are parsed correctly:

>>> from urllib.parse import urlparse
>>> urlparse("http://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='http', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
>>> urlparse("ftp://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='ftp', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')

...but for custom ones - they are not. params field remains empty. Instead, params are treated as a part of path:

>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint;param1=value1;param2=othervalue2', params='', query='query1=val1&query2=val2', fragment='fragment')

Why there is a difference in parsing depending on schema? How can I parse params within urlparse library using custom schema?

Konrad Sikorski
  • 399
  • 5
  • 11

2 Answers2

0

Can you remove that custom schemes from the url? That allways will return the params

urlparse("//some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
Uriel Alves
  • 129
  • 6
0

This is because urlparse assumes that only a set of schemes will uses parameters in their URL format. You can see that check with in the source code.

if scheme in uses_params and ';' in url:
        url, params = _splitparams(url)
    else:
        params = ''

Which means urlparse will attempt to parse parameters only if the scheme is in uses_params (which is a list of known schemes).

uses_params = ['', 'ftp', 'hdl', 'prospero', 'http', 'imap',
               'https', 'shttp', 'rtsp', 'rtspu', 'sip', 'sips',
               'mms', 'sftp', 'tel']

So to get the expected output you can append your custom scheme into uses_params list and perform the urlparse call again.

>>> from urllib.parse import uses_params, urlparse
>>>
>>> uses_params.append('scheme')
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
Abdul Niyas P M
  • 18,035
  • 2
  • 25
  • 46
  • Thank you for precise answer. Your solution would do the job but to be honest I don't really feel that modifying library internals that aren't described by the API would be a proper design pattern. Potentially, this approach could change behavior of other libraries that depend on urlparse. I think that eventually I'll end up with splitting parameters manually. – Konrad Sikorski Oct 19 '22 at 08:48
  • @KonradSikorski I agree, the `urlparse` API designed in that way! – Abdul Niyas P M Oct 19 '22 at 08:52