Morning far smarter than me peoples, I'm having some odd issues web-scraping Mashable.com which I hope someone can shed some light upon.
Mashable's search page populates the results from a block looking something like...
<script>
window.__bootstrap = {"posts":[{"_id":"54b687d512d2cd49040027dd","id":"2015/01/14/bitcoin-price-200","title":"Bitcoin prices collapse below $200 for first time since 2013","title_tag":null,"author":"Seth Fiegerman","post_date":"2015-01-14T15:14:19+00:00","post_date_rfc":"Wed, 14 Jan 2015 15:14:19 +0000","sort_key":"1ybqcU","link":"http://mashable.com/2015/01/14/bitcoin-price-200/","content":{"plain":"Bitcoin prices are collapsing almost as quickly as they originally skyrocketed.
My usual technique for overcoming such post-render issues is to use Selenium to grab the page however today things are not going to plan.
With the URL http://mashable.com/search/?t=stories&q=bitcoin&page=2 loaded through Selenium
remoteSelenium$navigate(uri) # send selenium to page
html <- unlist(remoteSelenium$getPageSource()) # read in page contents
I get...
> html
applicationCacheEnabled rotatable handlesAlerts databaseEnabled version
"TRUE" "FALSE" "TRUE" "TRUE" "34.0.5"
platform nativeEvents acceptSslCerts webdriver.remote.sessionid webStorageEnabled
"MAC" "FALSE" "TRUE" "ed06539a-59dc-41a5-ba4e-07b2ed9a9490" "TRUE"
locationContextEnabled browserName takesScreenshot javascriptEnabled cssSelectorsEnabled
"TRUE" "firefox" "TRUE" "TRUE" "TRUE"
... rather than the page source itself. Can't fathom why or how to resolve this as it works fine everywhere else I've tried it. Any thoughts or pointers to other questions/answers?