Technically, these are not valid URLs, but they are valid IRIs (Internationalized Resource Identifiers), as defined in RFC 3987.
The way you encode an IRI to a URI is:
- UTF-8 encode the path
- %-encode the resulting UTF-8
For example (taken from the linked Wikipedia article), this IRI:
https://en.wiktionary.org/wiki/Ῥόδος
… maps to this URI:
https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82
I believe requests
handles these out of the box (although only pretty recently, and only "partial support" is there until 3.0, and I'm not sure what that means). I'm pretty sure urllib2
in Python2.7 doesn't, and urllib.request
in Python 3.6 probably doesn't either.
At any rate, if your chosen HTTP library doesn't handle IRIs, you can do it manually:
def iri_to_uri(iri):
p = urllib.parse.urlparse(iri)
path = urllib.parse.quote_from_bytes(p.path.encode('utf-8'))
p = [:2] + (path,) + p[3:]
return urllib.parse.urlunparse(p2)
There are also a number of third-party libraries to handle IRIs, mostly spun off from other projects like Twisted and Amara. It may be worth searching PyPI for one rather than building it yourself.
Or you may want a higher-level library like hyperlink
that handles all of the complicated issues in RFC 3987 (and RFC 3986, the current version of the spec for URIs—which neither requests
2.x nor the Python 3.6 stdlib handle quite right).
If you have to deal with IRIs manually, there's a good chance you also have to deal with IDNs Internationalized Domain Names in place of ASCII domain names too, even though technically they're unrelated specs. So you probably want to do something like this:
def iri_to_uri(iri):
p = urllib.parse.urlparse(iri)
netloc = p.netloc.encode('idna').decode('ascii')
path = urllib.parse.quote_from_bytes(p.path.encode('utf-8'))
p = [:1] + (netloc, path) + p[3:]
return urllib.parse.urlunparse(p2)