Derive protocol from url

Question

I do have a list of urls such as ["www.bol.com ","www.dopper.com"]format. In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.

For example:

["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]

As you see the protocol might differ from https to http or even with or without www.

Not sure if there are any other variations.

is there any python tool that can determine the right protocol?
If not and I have to build the logic by myself what are the cases that I should take into account?

For option 2, this is what I have so far:

def identify_protocol(url):
    try:
        r = requests.get("https://" + url + "/", timeout=10)
        return r.url, r.status_code
    except requests.HTTPError:
        r = requests.get("http//" + url + "/", timeout=10)
        return r.url, r.status_code
    except requests.HTTPError:
        r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
        return r.url, r.status_code
    except:
        return None, None

is there any other possibility I should take into account?

If you don't *know*, all you can do is try and handle failed requests. You should probably default to HTTP, and secure services will *redirect* your request to HTTPS. Whether "www" can be included or can't be included or is optional will depend on the service; why would you have "www.dopper.com" as a start address when it must not contain "www"? — deceze, Oct 08 '21 at 10:03
I do not to handle fail requests- I need to scrape the website thus I need to find the correct protocol to do so. Do not understand your question. — A.Papa, Oct 11 '21 at 13:16

score 2 · Answer 1 · answered Oct 08 '21 at 10:06

There is no way to determine the protocol/full domain from the fragment directly, the information simply isn't there. In order to find it you would either need:

a database of the correct protocol/domains, which you can lookup your domain fragment in
to make the request and see what the server tells you

If you do (2) you can of course gradually build your own database to avoid needing the request in future.

On many https servers, if you attempt a http connection you will be redirected to https. If you are not, then you can reliably use the http. If the http fails, then you could try again with https and see if it works.

The same applies to the domain: if the site usually redirects, you can perform the request using the original domain and see where you are redirected.

An example using requests:

>>> import requests
>>> r = requests.get('http://bol.com')
>>> r
<Response [200]>
>>> r.url
'https://www.bol.com/nl/nl/'

As you can see the request object url parameter has the final destination URL, plus protocol.

That is great @mfitzp! I think doing the check before hand is the option to go forward. Currently I have developed a simple function with try and except to identify the correct protocol. Do I have all possible scenarios or should I consider more as well? (post updated) — A.Papa, Oct 08 '21 at 10:57

Olvin Roght · Accepted Answer · 2021-10-08T11:59:45.637

As I understood question, you need to retrieve final url after all possible redirections. It could be done with built-in urllib.request. If provided url has no scheme you can use http as default. To parse input url I used combination of urlsplit() and urlunsplit().

Code:

import urllib.request as request
import urllib.parse as parse

def find_redirect_location(url, proxy=None):
    parsed_url = parse.urlsplit(url.strip())
    url = parse.urlunsplit((
        parsed_url.scheme or "http",
        parsed_url.netloc or parsed_url.path,
        parsed_url.path.rstrip("/") + "/" if parsed_url.netloc else "/",
        parsed_url.query,
        parsed_url.fragment
    ))

    if proxy:
        handler = request.ProxyHandler(dict.fromkeys(("http", "https"), proxy))
        opener = request.build_opener(handler, request.ProxyBasicAuthHandler())
    else:
        opener = request.build_opener()

    with opener.open(url) as response:
        return response.url

Then you can just call this function on every url in list:

urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(map(find_redirect_location, urls))

You can also use proxies:

from itertools import cycle

urls = ["bol.com ","www.dopper.com", "https://google.com"]
proxies = ["http://localhost:8888"]
final_urls = list(map(find_redirect_location, urls, cycle(proxies)))

To make it a bit faster you can make checks in parallel threads using ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor

urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(ThreadPoolExecutor().map(find_redirect_location, urls))

Derive protocol from url

2 Answers2