1

Sorry for rookie question. I was wondering if there is an efficient url opener class in python that handle redirects. I'm currently using simple urllib.urlopen() but It's not working. This is an example:

http://thetechshowdown.com/Redirect4.php

For this url, the class I'm using does not follow the redirection to:

http://www.bhphotovideo.com/

and only shows:

"You are being automatically redirected to B&H.

Page Stuck? Click Here ."

Thanks in advance.

user3821329
  • 317
  • 1
  • 6
  • 14
  • Thanks for the 2 answers. However, I encounter some redirections that are not followed by any of methods suggested. For example this one: http://blekko.com/a/serpr?r=4WgZKGhYDCRRBim_2cosrjKEWOf2Fmz8PQlVyYT5iqcD875s_gIIdE5G97qsuVbtj8Ww5gQazWmBdBz-yuGsXgysUqkaMna2-8_3m98RhJ5pR4VlxPttPVBDOpHozJKVUP1zNw7qfiQJlUGQoVQvyrM-K6CTKT6LRPjOi-6cgJxOErFzNpLbcl9pPU6DLkNdy12hN84qqttbQlYXYc49SA&bt=&p=7 ///url finished here, it finally redirects to this link: http://www.zappos.com/mark-nason-boots – user3821329 Jul 17 '14 at 20:11
  • 1
    There is no universal solution. There are three method of redirection. 1) server send HTTP response with redirection - and `requests` follows this redirection. 2) HTML have tag `` with url - and we show solution to this problem 3) javascript can redirect page - but it can use one simple line `window.location.href='http://www.example.com'` or it can hide url in someway - but `requests` and `urllib` can't run javascript. Your first example use second method and we made solution for this method only. – furas Jul 17 '14 at 23:10

3 Answers3

4

Use module requests - it folows redirections as default.

But page can be redirected by javascript so none of modules will follow this kind of redirection.

Turn off javascript in browser and go to http://thetechshowdown.com/Redirect4.php to see if it redirects you to other page

I checked this page - there is javascript redirect and HTML redirect (tag with "refresh" argument). Both aren't normal redirection send by server - so any module will not follow this redirection. You have to read page, find url in code and connect with that url.

import requests
import lxml, lxml.html

# started page

r = requests.get('http://thetechshowdown.com/Redirect4.php')

#print r.url
#print r.history
#print r.text

# first redirection

html = lxml.html.fromstring(r.text)

refresh = html.cssselect('meta[http-equiv="refresh"]')

if refresh:
    print 'refresh:', refresh[0].attrib['content']
    x = refresh[0].attrib['content'].find('http')
    url = refresh[0].attrib['content'][x:]
    print 'url:', url

r = requests.get(url)

#print r.text

# second redirection

html = lxml.html.fromstring(r.text)

refresh = html.cssselect('meta[http-equiv="refresh"]')

if refresh:
    print 'refresh:', refresh[0].attrib['content']
    x = refresh[0].attrib['content'].find('http')
    url = refresh[0].attrib['content'][x:]
    print 'url:', url

r = requests.get(url)

# final page

print r.text
furas
  • 134,197
  • 12
  • 106
  • 148
  • I tried requests. Unfortunately it also did not follow the redirection as you speculated... Moreover, the problem is not with the browser as it follows the redirection. (I'm using Firefox and That's how I got the url of the redirected page) the problem is that the python module does not follow it. – user3821329 Jul 12 '14 at 20:01
  • Did you try my code - it follows redirection for this example. Python follow only redirections send by server (using `HTTP` protocol) - not JavaScript and HTML tag. I use browser only to test what kind of redirection it can be. – furas Jul 12 '14 at 20:04
  • And there was two redirections from that url to `bhphotovideo.com` – furas Jul 12 '14 at 20:06
  • Thanks for your answer. But there is another issue. for some urls I get an error message in this line: html = lxml.html.fromstring(r.text) File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121) File "parser.pxi", line 1781, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435) ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. Any clues? – user3821329 Jul 13 '14 at 00:55
  • Show full error message and line. There is no ` from BeautifulSoup import BeautifulSoup` in code. – furas Jul 13 '14 at 00:56
  • probably problem with native letters or page use different string coding than 'utf-8'. – furas Jul 13 '14 at 00:58
  • You can try `.fromstring( r.text.decode('UTF-8') )`. Or maybe it read XML page - there is some problem with `encoding declaration` in `XML`. Search on Google. – furas Jul 13 '14 at 01:07
  • Thanks for the answer. However, I still encounter some redirections that are not followed. For example this one: http://blekko.com/a/serpr?r=4WgZKGhYDCRRBim_2cosrjKEWOf2Fmz8PQlVyYT5iqcD875s_gIIdE5G97qsuVbtj8Ww5gQazWmBdBz-yuGsXgysUqkaMna2-8_3m98RhJ5pR4VlxPttPVBDOpHozJKVUP1zNw7qfiQJlUGQoVQvyrM-K6CTKT6LRPjOi-6cgJxOErFzNpLbcl9pPU6DLkNdy12hN84qqttbQlYXYc49SA&bt=&p=7 ///url finished here, it finally redirects to this link: http://www.zappos.com/mark-nason-boots – user3821329 Jul 17 '14 at 20:24
0

That happens because of soft redirects. urllib is not following the redirects because it does not recognize them as such. In fact a HTTP response code 200 (page found) is issued and redirection will happen by some sort of side effect in browsers.

The first page has a HTTP responde code 200, but contains the following:

<meta http-equiv="refresh" content="1; url=http://fave.co/1idiTuz">

which instructs the browser to follow the link. The second resource issues a HTTP responsec code 301 or 302 (redirect) to another resource, where a second soft redirect takes place, this time with Javascript:

<script type="text/javascript">
    setTimeout(function () {window.location.replace(\'http://bhphotovideo.com\');}, 2.75 * 1000);
</script>
<noscript>
    <meta http-equiv="refresh" content="2.75;http://bhphotovideo.com" />
</noscript>

Unfortunately, you will have to extract the URLs to follow by hand. However, it's not difficult. Here is the code:

from lxml.html import parse
from urllib import urlopen
from contextlib import closing

def follow(url):
    """Follow both true and soft redirects."""
    while True:
        with closing(urlopen(url)) as stream:
            next = parse(stream).xpath("//meta[@http-equiv = 'refresh']/@content")
            if next:
                url = next[0].split(";")[1].strip().replace("url=", "")
            else:
                return stream.geturl()

print follow("http://thetechshowdown.com/Redirect4.php")

I will leave the error handling to you :) also note that this might result in an endless loop if the target page contains a <meta> tag too. It is not your case, but you could add some sort of checks to prevent that: stop after n redirects, see if page is redirecting to itself, whichever you think is better.

You will probably need to install the lxml library.

Stefano Sanfilippo
  • 32,265
  • 7
  • 79
  • 80
  • Thanks for the answer. However, I encounter some redirections that still are not followed. For example this one: http://blekko.com/a/serpr?r=4WgZKGhYDCRRBim_2cosrjKEWOf2Fmz8PQlVyYT5iqcD875s_gIIdE5G97qsuVbtj8Ww5gQazWmBdBz-yuGsXgysUqkaMna2-8_3m98RhJ5pR4VlxPttPVBDOpHozJKVUP1zNw7qfiQJlUGQoVQvyrM-K6CTKT6LRPjOi-6cgJxOErFzNpLbcl9pPU6DLkNdy12hN84qqttbQlYXYc49SA&bt=&p=7 ///url finished here, it finally redirects to this link: http://www.zappos.com/mark-nason-boots – user3821329 Jul 17 '14 at 20:24
  • I forgot to consider that corner case. Try the edited version. – Stefano Sanfilippo Jul 18 '14 at 10:02
0

The meta refresh redirection urls from html could look like any of these:

Relative urls:

<meta http-equiv="refresh" content="0; url=legal_notices_en.htm#disclaimer">

With quotes inside quotes:

<meta http-equiv="refresh" content="0; url='legal_notices_en.htm#disclaimer'">

Uppercase letters in the content of the tag:

<meta http-equiv="refresh" content="0; URL=legal_notices_en.htm#disclaimer">

Summary:

  • Use lxml.xml to parse the html,
  • Use a lower() and two split()s to get the url part,
  • Strip eventual wrapping quotes and spaces,
  • Get the absolute url,
  • Store the cache of the results in a local file with shelves (useful if you have lots of urls to test).

Usage:

print get_redirections('https://www.google.com')

Returns something like:

{'final': u'https://www.google.be/?gfe_rd=fd&ei=FDDASaSADFASd', 'history': [<Response [302]>]}

Code:

from urlparse import urljoin, urlparse
import urllib, shelve, lxml, requests
from lxml import html

def get_redirections(initial_url, url_id = None):
    if not url_id:
        url_id = initial_url
    documents_checked = shelve.open('tested_urls.log')
    if url_id in documents_checked:
        print 'cached'
        output = documents_checked[url_id]
    else:
        print 'not cached'
        redirecting = True
        history = []
        try:
            current_url = initial_url
            while redirecting:
                r = requests.get(current_url)
                final = r.url
                history += r.history
                status = {'final':final,'history':history}

                html = lxml.html.fromstring(r.text.encode('utf8'))
                refresh = html.cssselect('meta[http-equiv="refresh"]')
                if refresh:
                    refresh_content = refresh[0].attrib['content']

                    current_url = refresh_content.lower().split('url=')[1].split(';')[0]
                    before_stripping = ''
                    after_stripping = current_url

                    while before_stripping != after_stripping:
                        before_stripping = after_stripping
                        after_stripping = before_stripping.strip('"').strip("'").strip()

                    current_url = urljoin(final, after_stripping)
                    history += [current_url]

                else:
                    redirecting = False

        except requests.exceptions.RequestException as e:
            status = {'final':str(e),'history':[],'error':e}

        documents_checked[url_id] = status
        output = status

    documents_checked.close()
    return output

url = 'http://google.com'
print get_redirections(url)
Ivan Chaer
  • 6,980
  • 1
  • 38
  • 48