That happens because of soft redirects. urllib
is not following the redirects because it does not recognize them as such. In fact a HTTP response code 200 (page found) is issued and redirection will happen by some sort of side effect in browsers.
The first page has a HTTP responde code 200, but contains the following:
<meta http-equiv="refresh" content="1; url=http://fave.co/1idiTuz">
which instructs the browser to follow the link. The second resource issues a HTTP responsec code 301 or 302 (redirect) to another resource, where a second soft redirect takes place, this time with Javascript:
<script type="text/javascript">
setTimeout(function () {window.location.replace(\'http://bhphotovideo.com\');}, 2.75 * 1000);
</script>
<noscript>
<meta http-equiv="refresh" content="2.75;http://bhphotovideo.com" />
</noscript>
Unfortunately, you will have to extract the URLs to follow by hand. However, it's not difficult. Here is the code:
from lxml.html import parse
from urllib import urlopen
from contextlib import closing
def follow(url):
"""Follow both true and soft redirects."""
while True:
with closing(urlopen(url)) as stream:
next = parse(stream).xpath("//meta[@http-equiv = 'refresh']/@content")
if next:
url = next[0].split(";")[1].strip().replace("url=", "")
else:
return stream.geturl()
print follow("http://thetechshowdown.com/Redirect4.php")
I will leave the error handling to you :) also note that this might result in an endless loop if the target page contains a <meta>
tag too. It is not your case, but you could add some sort of checks to prevent that: stop after n
redirects, see if page is redirecting to itself, whichever you think is better.
You will probably need to install the lxml
library.