0

I am using Python 2.7.

I want to open the URL of a website and extract information out of it. The information I am looking for is within the US version of the website (http://www.thewebsite.com) . Since I am based in Canada, I get automatically redirected to the Canadian version of the website (http://ca.thewebsite.com). I am looking for a solution to try to avoid this.

If I take any browser (IE, Firefox, Chrome, ...) and navigate to http://www.thewebsite.com, I will get redirected. The website offers a menu where the visitor can pick the "country-version" of the website he wants to view. Once I select United States, I am no longer redirected to the Canadian version of the website. This is true for any new tab within the browsing session. I suspect this has to do with cookies storage.

I tried to use the following code to prevent the redirect:

import urllib2
class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result
    http_error_301 = http_error_303 = http_error_307 = http_error_302

opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open('http://www.thewebsite.com')

but it didn't seem to work since the only bit of code that can be extracted afterwards is:

<html><head></head><body>‹</body></html>

A solution to my problem would be to use a proxy while scraping the website but I was wondering if there is any way to prevent these kind of redirects using exclusively Python or Python packages.

LaGuille
  • 1,658
  • 5
  • 20
  • 37

1 Answers1

0

I would use mechanize, http://wwwsearch.sourceforge.net/mechanize/

And you can use

# Don't handle Refresh redirections br.set_handle_refresh(False)

Where 'br' is the variable associated with the open webpage. Mechanize also has proxy support

Grady D
  • 1,889
  • 6
  • 30
  • 61