Screen scraping dynamic webpage in python with Ghost.py

Question

ghost = Ghost()
page, rcs = ghost.open(https://soundcloud.com/passionpit/sets/favorites)
page, rcs = ghost.wait_for_page_loaded()
songs = ghost.evaluate("document.getElementsByClassName('soundTitle__title');")
print songs

I am attempting to use the above code to find all html elements on the above page that have the class 'soundTitle__title' however as of right now my output is

QFont::setPixelSize: Pixel size <= 0 (0)
({PyQt4.QtCore.QString(u'length'): 0.0}, [])

Can anyone help me see where my problem is? When I run document.getElementsByClassName('soundTitle__title') in my browsers console I get the output I expect, why is the Python output different?

Or is there some way for me to use Ghost.py or another similar library to get the source of the page after the JavaScript has run (the source seen when inspecting an element with browser developer tools)?

I'm sorry, but I just have to do this with lxml.html. `from lxml import html` `html.parse("https://soundcloud.com/passionpit/sets/favorites").getroot().cssselect(".soundTitle__title")` — ToonAlfrink, Jun 24 '14 at 22:12
I tried running your code and ran into some issues. My output is `IOError: Error reading file 'https://soundcloud.com/passionpit/sets/favorites': failed to load external entity "https://soundcloud.com/passionpit/sets/favorites"` — walshie4, Jun 24 '14 at 22:20
There is [SoundCloud API](https://developers.soundcloud.com/) - something for developers. — furas, Jun 25 '14 at 01:35
There is even [Python wrapper around the SoundCloud API](https://github.com/soundcloud/soundcloud-python) — furas, Jun 25 '14 at 01:37
I suggest you call the undocumented `show()` (https://github.com/jeanphix/Ghost.py/blob/master/ghost/ghost.py#L886) which will open the QWebView Ghost is using (it's not really headless). This might provide some clues. Also note `evaluate()` returns a tuple of `result, resources`. — Ben Graham, Jun 25 '14 at 15:09
It shows what I am looking for, however calling `g.content` still gives me the un-executed source of the page. — walshie4, Jun 26 '14 at 04:54

score 4 · Accepted Answer · answered Jun 29 '14 at 15:03

I got this working, and would recommend, using Splinter, which is basically just running phantomjs and selenium under the hood.

You'll need to run pip install splinter and also install phantomjs on your machine, either by downloading/untarring or npm -g install phantomjs if you have npm, etc. But overall the installation and dependencies are minimal and straightforward.

The following code returns 'Ryn Weaver - OctaHate', which I'm assuming is what you're looking for, although without more context I can't be totally sure.

from splinter import Browser

browser = Browser('phantomjs')
browser.visit('https://soundcloud.com/passionpit/sets/favorites')
songs = browser.find_by_xpath("//a[contains(@class, 'soundTitle__title')]")
if songs:
    for song in songs:
        print song.text
else:
    print "there aren't any songs"

You'll also notice that I had to do an xpath-contains to get the class description that you were looking for; so, you might be running into a problem when trying to access that class by the notation you were using - there is a span element and also an anchor element that both contain 'soundTitle__title' but as far as I could tell, only the 'a' element had text and I would guess that's what you're looking for. But if you want both you could just do browser.find_by_xpath("//*[contains(@class, 'soundTitle__title')]")

Screen scraping dynamic webpage in python with Ghost.py

1 Answers1