14

I want to check whether a URL is valid, before I open it to read data.

I was using the function urlparse from the urlparse package:

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

However, I noticed that some valid URLs are treated as broken, for example:

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

This URL is valid (I can open it using my browser).

Is there a better way to check if the URL is valid?

ali_m
  • 71,714
  • 23
  • 223
  • 298
Ziva
  • 3,181
  • 15
  • 48
  • 80

4 Answers4

14

TL;DR: You can't actually. Every answer given already misses 1 or more cases.

  1. String is google.com (invalid since no scheme, even though a browser assumes by default http). Urlparse will be missing scheme and netloc. So all([result.scheme, result.netloc, result.path]) seems to work for this case
  2. String is http://google (invalid since .com is missing). Urlparse will be missing only path. Again all([result.scheme, result.netloc, result.path]) seems to catch this case
  3. String is http://google.com/ (correct). Urlparse will populate scheme, netloc and path. So for this case all([result.scheme, result.netloc, result.path]) works fine
  4. String is http://google.com (correct). Urlparse will be missing only path. So for this case all([result.scheme, result.netloc, result.path]) seems to give a false negative

So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path]). But this works only in cases where the url contains a path (even if that is the / path).

Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/")) you will still get a false positive in case 2

Maybe something more complicated like

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URL to think even more cases

  • 2
    urljoin and urlparse end up calling urlsplit which may throw a ValueError if there are brackets (IPv6) in what it thinks is the netloc, so exception handling is necessary too – digenishjkl Oct 05 '21 at 10:12
13

You can check if the url has the scheme:

>>> url = "no.scheme.com/math/12345.png"
>>> parsed_url = urlparse.urlparse(url)
>>> bool(parsed_url.scheme)
False

If it's the case, you can replace the scheme and get a real valid url:

>>> parsed_url.geturl()
"no.scheme.com/math/12345.png"
>>> parsed_url = parsed_url._replace(**{"scheme": "http"})
>>> parsed_url.geturl()
'http:///no.scheme.com/math/12345.png'
serv-inc
  • 35,772
  • 9
  • 166
  • 188
xbello
  • 7,223
  • 3
  • 28
  • 41
  • +1 for the trick with replacing the tuple which I find very elegant (and didn't know about). The only problem here is that the returned url contains three slashes after the scheme as the url with no scheme is interpreted as `path` instead of `netloc`. A simple `.replace('///', '//')` does the trick for me at least. – taffit Jul 14 '16 at 09:59
  • You missed `import urlparse` – alexey_efimov Oct 12 '16 at 07:53
  • @alexey_efimov, the question already said "I was using the argparse package". – xbello Oct 12 '16 at 12:06
  • Else, you can simply use `import urllib; urllib.parse.urlparse(url, scheme='http')` to get the same result.. – vicke4 Aug 09 '17 at 21:15
5

You can try the function below which checks scheme, netloc and path variables which comes after parsing the url. Supports both Python 2 and 3.

try:
    # python 3
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse

def url_validator(url):
    try:
        result = urlparse(url)
        components = [result.scheme, result.path]
        if result.netloc != "":
            components.append(result.netloc)
        return all(components)
    except:
        return False
abdullahselek
  • 7,893
  • 3
  • 50
  • 40
1

Url without schema is actually invalid, your browser is just clever enough to suggest http:// as schema for it. It may be a good solution to check if url doesn't have schema (not re.match(r'^[a-zA-Z]+://', url)) and prepend http:// to it.

vil
  • 917
  • 1
  • 8
  • 12