I'm trying to check if some URLs that are stored in my DB are still valid links or not. To achieve this, I am using httplib2 to request the HEAD status in order to avoid downloading the entire content of the page. And I was quite happy with the results.
But then I discovered some cases where the status code returned when using a HEAD request is not similar to the one returned with a GET request.
So, just in case of a bug in the library, I made some tests with different libraries (below is my "requests" lib test):
> import requests
> rg = requests.get("https://fr.news.yahoo.com/chemin-dames-l-hommage-personnel-pr%C3%A9sident-121005844.html")
> rh = requests.head("https://fr.news.yahoo.com/chemin-dames-l-hommage-personnel-pr%C3%A9sident-121005844.html")
> print("GET status code:", rg.status_code)
('GET status code:', 200)
> print("HEAD status code:", rh.status_code)
('HEAD status code:', 404)
But whatever lib I use, I still have a different GET & HEAD status for the same URL.
So, obviously the site maintainer decided to not return an identical status code for both HEAD and GET request... and that seems legit even if not recommended.
Is there a way to avoid this problem and still know if the link is a valid one without having to download the entire content of the almost 2 millions url that I need to verify?
I can double check with a GET request whenever a >400 status code is returned on a HEAD request but that seems like a dirty work to me.