I'm trying to scrape the data off a web page using the python Windmill framework. However I'm having problems getting the HTML table content off a page. The table is generated by Javascript - hence I'm using Windmill to grab the content. However the content doesn't return the table - which causes errors if I use BeautifulSoup to try and parse the content.
from windmill.authoring import WindmillTestClient
from BeautifulSoup import BeautifulSoup
from copy import copy
import re
def get_massage():
my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)
my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))
my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))
return my_massage
def test_scrape():
my_massage = get_massage()
client = WindmillTestClient(__name__)
client.open(url='http://marinetraffic.com/ais/datasheet.aspx?MMSI=636092060&TIMESTAMP=2&menuid=&datasource=POS&app=&mode=&B1=Search')
client.waits.forPageLoad(timeout='60000')
html = client.commands.getPageText()
assert html['status']
assert html['result']
soup=BeautifulSoup(html['result'],markupMassage=my_massage)
print soup.prettify()
When you look at the output from the soup the table is missing, yet it's displayed if you look at the webpage content with something like firebug. Overall I'm trying to grab the table content and parse it into some kind of data structure for further processing. Any help is much appreciated!