How can I check whether a URL is valid using `urlparse`?

Question

I want to check whether a URL is valid, before I open it to read data.

I was using the function urlparse from the urlparse package:

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

However, I noticed that some valid URLs are treated as broken, for example:

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

This URL is valid (I can open it using my browser).

Is there a better way to check if the URL is valid?

@xfx But I have a lot of links, and I don't know if it will start with http:// or it will not, if it is a valid url or not. I want to write a function, which will tell me this avoiding this types of mistakes. — Ziva, Aug 12 '14 at 08:11
If you're going to open it with urllib2 anyway, can't you just open it first and check if the return code equals 200? — Dunno, Aug 12 '14 at 08:12
If it's mainly the http:// that's the issue, `if(url[:7] != 'http://'):`...`url = 'http://' + url` — flau, Aug 12 '14 at 08:17

John Paraskevopoulos · Answer 1 · 2020-02-03T10:35:22.440

TL;DR: You can't actually. Every answer given already misses 1 or more cases.

String is google.com (invalid since no scheme, even though a browser assumes by default http). Urlparse will be missing scheme and netloc. So all([result.scheme, result.netloc, result.path]) seems to work for this case
String is http://google (invalid since .com is missing). Urlparse will be missing only path. Again all([result.scheme, result.netloc, result.path]) seems to catch this case
String is http://google.com/ (correct). Urlparse will populate scheme, netloc and path. So for this case all([result.scheme, result.netloc, result.path]) works fine
String is http://google.com (correct). Urlparse will be missing only path. So for this case all([result.scheme, result.netloc, result.path]) seems to give a false negative

So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path]). But this works only in cases where the url contains a path (even if that is the / path).

Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/")) you will still get a false positive in case 2

Maybe something more complicated like

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URL to think even more cases

urljoin and urlparse end up calling urlsplit which may throw a ValueError if there are brackets (IPv6) in what it thinks is the netloc, so exception handling is necessary too — digenishjkl, Oct 05 '21 at 10:12

score 13 · Answer 2 · edited Sep 21 '15 at 17:26

13

You can check if the url has the scheme:

>>> url = "no.scheme.com/math/12345.png"
>>> parsed_url = urlparse.urlparse(url)
>>> bool(parsed_url.scheme)
False

If it's the case, you can replace the scheme and get a real valid url:

>>> parsed_url.geturl()
"no.scheme.com/math/12345.png"
>>> parsed_url = parsed_url._replace(**{"scheme": "http"})
>>> parsed_url.geturl()
'http:///no.scheme.com/math/12345.png'

edited Sep 21 '15 at 17:26

serv-inc

35,772
9
166
188

answered Aug 12 '14 at 08:24

xbello

7,223
3
28
41

+1 for the trick with replacing the tuple which I find very elegant (and didn't know about). The only problem here is that the returned url contains three slashes after the scheme as the url with no scheme is interpreted as `path` instead of `netloc`. A simple `.replace('///', '//')` does the trick for me at least. – taffit Jul 14 '16 at 09:59
You missed `import urlparse` – alexey_efimov Oct 12 '16 at 07:53
@alexey_efimov, the question already said "I was using the argparse package". – xbello Oct 12 '16 at 12:06
Else, you can simply use `import urllib; urllib.parse.urlparse(url, scheme='http')` to get the same result.. – vicke4 Aug 09 '17 at 21:15

abdullahselek · Answer 3 · 2023-04-06T07:18:10.090

5

You can try the function below which checks scheme, netloc and path variables which comes after parsing the url. Supports both Python 2 and 3.

try:
    # python 3
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse

def url_validator(url):
    try:
        result = urlparse(url)
        components = [result.scheme, result.path]
        if result.netloc != "":
            components.append(result.netloc)
        return all(components)
    except:
        return False

edited Apr 06 '23 at 07:18

answered Dec 07 '17 at 11:55

abdullahselek

7,893
3
50
40

Fails on a valid URL. `>>> url_validator("file:///some_file.txt") False` – dgrogan Apr 04 '23 at 17:50
made minor changes, you can try again – abdullahselek Apr 06 '23 at 07:18

score 1 · Answer 4 · answered Aug 12 '14 at 08:13

1

Url without schema is actually invalid, your browser is just clever enough to suggest http:// as schema for it. It may be a good solution to check if url doesn't have schema (not re.match(r'^[a-zA-Z]+://', url)) and prepend http:// to it.

answered Aug 12 '14 at 08:13

vil

917
1
8
12

How can I check whether a URL is valid using `urlparse`?

4 Answers4