I do have a list of urls such as ["www.bol.com ","www.dopper.com"]
format.
In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.
For example:
["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]
As you see the protocol might differ from https
to http
or even with or without www.
Not sure if there are any other variations.
- is there any python tool that can determine the right protocol?
- If not and I have to build the logic by myself what are the cases that I should take into account?
For option 2, this is what I have so far:
def identify_protocol(url):
try:
r = requests.get("https://" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("http//" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
return r.url, r.status_code
except:
return None, None
is there any other possibility I should take into account?