10

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2. However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.

Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?

import urllib2
from bs4 import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()

Edit: The answer is thorough, however, in the end what solved my problem was this: https://stackoverflow.com/a/3210737/1157283

Community
  • 1
  • 1
Ken
  • 530
  • 4
  • 11
  • 30
  • doesnt urllib raise an error? there is a redirectdirector for such cases... – Don Question Jul 12 '12 at 20:50
  • @DonQuestion No error, I just get the html from the page that briefly appears before being redirected. I want the html from the page that appears in the end. What is this redirectdirector, can you elaborate? – Ken Jul 12 '12 at 20:55
  • if your using urlopen, you are using OpenerDirector.open() look at the python-docs - unfortunately its not explained in 2-3 words :-( : http://docs.python.org/library/urllib2.html?highlight=urllib2#urllib2.OpenerDirector – Don Question Jul 12 '12 at 21:08

1 Answers1

6

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

Travel/Hotel API's? it looks they might, though with some restrictions.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

hope that helps.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)
Community
  • 1
  • 1
Samy Vilar
  • 10,800
  • 2
  • 39
  • 34
  • Thanks for your answer. Let me try to reiterate some of that: so when you click on the different categories like "Luxury" or "Families", the changes you see on the page are generated solely through javascript? (ie the code for the page never changes?) And what I need to do is find a tool that will run the JS and then return that content? What is easiest/the best from what you recommended? I feel an api is not appropriate for what I'm trying to do in this case. – Ken Jul 12 '12 at 21:16
  • selenium maybe the best way to do this, it uses the actual browser though fully automated but as such it needs a browser installed with at least a virtual frame-buffer or a desktop environment, since it will call one up ... – Samy Vilar Jul 12 '12 at 21:26