1

I'm trying to collect data from the SumofUs website; specifically the number of signatures on the petition. The datum is presented like this: <div class="percent">256,485 </div> (this is the only item of this class on the Page.)

So I tried this:

import requests
from bs4 import BeautifulSoup

user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'http://action.sumofus.org/a/nhs-patient-corporations/'

raw  = requests.get(url, headers = user_agent)
html = BeautifulSoup(raw.text)

# get the item we're seeking
number = html.find("div", class_="percent")
print number

It seems that the number isn't rendered (I've tried a couple of user agent strings.) What else could be causing this? How can I work around this in future?

mediaczar
  • 1,960
  • 3
  • 18
  • 23
  • 1
    Have you considered the possibility that this string is rendered by JavaScript once the page has been loaded? – Joel Cornett Mar 04 '14 at 16:56
  • 1
    The string is indeed rendered by Javascript after the page has been loaded so the best option for scraping the page may be to get the page then run the Javascript using a headless browser and scrape the result. Options for doing that are given here - http://stackoverflow.com/questions/16375251/evaluate-javascript-on-a-local-html-file-without-browser. – Hugh McGrade Mar 04 '14 at 17:02

2 Answers2

1

You could use Selenium:

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'http://action.sumofus.org/a/nhs-patient-corporations/'
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load

# then load BeautifulSoup with browsers content
html = BeautifulSoup(driver.page_source)
...
PepperoniPizza
  • 8,842
  • 9
  • 58
  • 100
  • While this is indeed cool and taught me something new, I _think_ that it lacks the general applicability of the other response. Totally going to try it though. – mediaczar Mar 04 '14 at 23:17
1

In the general case you should use a headless browser. Ghost.py is written in python so its probably a good choice to try first.

In this specific case a little research reveals that there's a much simpler method. By using the network tab in chrome you can see that the site makes an ajax call to populate the value. So you can just get it directly:

url = "http://action.sumofus.org/api/ak_action_count_by_action/?action=nhs-patient-corporations&additional="
number = int(requests.get(url).text)
Reite
  • 1,677
  • 10
  • 12