0

I use spynner for scraping data from a site. My code is this:

import spynner

br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
text = br._get_html()

This code fails to load the entire html page. This is the html that I received:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>

<script type="text/javascript">(function(){var d=document,m=d.cookie.match(/_abs=(([or])[a-z]*)/i)
v_abs=m?m[1].toUpperCase():'N'
if(m){d.cookie='_abs='+v_abs+'; path=/; domain=.venere.com';if(m[2]=='r')location.reload(true)}
v_abp='--OO--OOO-OO-O'
v_abu=[,,1,1,,,1,1,1,,1,1,,1]})()

My question is: how do I load the complete html?

More information:

I tried with:

import spynner
br = spynner.Browser()
respond = br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")

if respond == None:
   br.wait_load ()

but loading html is never complete or certain. What is the problem? I'm going crazy.

Again: I'm working in Django 1.3. If I use the same code in Python (2.7) sometimes load all html.

RoverDar
  • 441
  • 2
  • 12
  • 32
  • have you tried br.wait_load() ? – andrean Oct 16 '12 at 14:09
  • ok I used the same code you use with br.wait_load(5) and got the full page back.. – andrean Oct 16 '12 at 14:19
  • It is true bat doesn't load the full page. If you see the html page with Chrome you can find this code:

    "revoew text"

    . With wait_load(5) the id="feedback-1052368" daesn't load.
    – RoverDar Oct 16 '12 at 14:34
  • sorry I could not find any id starting with "feedback" in chrome, are you sure this is the right url? – andrean Oct 16 '12 at 15:52
  • Yes I'm sure. If you select the text the review and see the "Element Inspection" you can find this:

    This guest didn't leave us a comment.

    – RoverDar Oct 16 '12 at 15:59

1 Answers1

0

Now after you check the contents of test.html you will find the p elements with id="feedback-...somenumber..." :

import spynner

def content_ready(browser):
    if 'id="feedback-' in browser.html:
        return True

br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews", wait_callback=content_ready)

with open("test.html", "w") as hf:
    hf.write(br.html.encode("utf-8"))
andrean
  • 6,717
  • 2
  • 36
  • 43
  • 1
    unfortunately sometimes br.load() doesn't load the review text and load 'id-feedback-'. It's impossible to do a general function (I have many url..). Only solution for hour is to have wait_load(10)... bat it's very slow. Thak you andrean for your patience. – RoverDar Oct 17 '12 at 07:45
  • It's possible to use 'wait_callback' for to know when all the content is load? – RoverDar Oct 17 '12 at 12:52
  • well wait_callback does nothing else in the background, except the test that we write inside the function body. we are supposed to write a function which will return True if we are satisfied with the page result. So that test function should contain check(s) to determine whether the page is loaded or not. you could for example try searching for multiple elements in the page body, and if all of them are found you return True, meaning the page is loaded. if it won't return True, it will be called again and again by spynner until the checks won't return True. – andrean Oct 17 '12 at 13:46