1

I'm building a script that every now and then crawls through an online story archive and detects when a story has been deleted. However, when a story is deleted, I discovered that going to the story's URL does not return a HTTP 404 response code. Instead it redirects to a custom "Page not Found" page, and returns the 200 OK response code. This means that, unlike my original idea, I can't just check for a 404.

What is the best way to detect these redirect 404's without detecting any false positives?

2 Answers2

1

If the server doesn't returns the 404 HTTP code (which is bad, really, you should send a mail to the webmaster), there's no simple way to do it.

  • You can keep a list of words/sentences which might be only in an error page.
    For example "Page not found", "404 error", etc. Search in the page title, <h[1-3]> tags...

  • For each domain/website, you can try an URL which doesn't exists (put a random 512bits string, there's 99% of chance that's will be a 404 error page), and check if pages are the same (with some variations authorized...)

For example, I'm pretty sure that https://stackoverflow.com/iapbFeq1X33hgg5Dy9zaFUbSnG7 isn't a valid URL. Takes the HTML code of this page as a reference and if you check any page on stackoverflow.com (for example stackoverflow.com/page1), check if the code isn't the same/nearly the same. If is it, there's a great chance that stackoverflow.com/page1 is a 404 error page too.

Note: I assume here that SO returns 200 code even on error page for the example, which is wrong in reality of cours. Check the HTTP error code in first place, it's easier :)

Community
  • 1
  • 1
Maxime Lorant
  • 34,607
  • 19
  • 87
  • 97
1

Besides parsing the texts of soft 404page, another way to implement this is that you confine the redirection and check if the status_code is 200 or not. (redirect page returns typically 301, 302, or so)

Most likely you are having a similar result as following:

import requests
r = requests.get("http://httpbin.org/redirect/1")
r.status_code   #This will return 200

If you, however, disallow redirection, the page will return other response status code, such as 301, etc. You can use allow_redirects argument to do this.

import requests
r = requests.get("http://httpbin.org/redirect/1", allow_redirects=False)
r.status_code   #This will return 302

Please note that this method won't work in case that redirection is actually needed for other purposes.

nerdysu
  • 21
  • 5