0

Morning far smarter than me peoples, I'm having some odd issues web-scraping Mashable.com which I hope someone can shed some light upon.

Mashable's search page populates the results from a block looking something like...

    <script>
  window.__bootstrap = {"posts":[{"_id":"54b687d512d2cd49040027dd","id":"2015/01/14/bitcoin-price-200","title":"Bitcoin prices collapse below $200 for first time since 2013","title_tag":null,"author":"Seth Fiegerman","post_date":"2015-01-14T15:14:19+00:00","post_date_rfc":"Wed, 14 Jan 2015 15:14:19 +0000","sort_key":"1ybqcU","link":"http://mashable.com/2015/01/14/bitcoin-price-200/","content":{"plain":"Bitcoin prices are collapsing almost as quickly as they originally skyrocketed.

My usual technique for overcoming such post-render issues is to use Selenium to grab the page however today things are not going to plan.

With the URL http://mashable.com/search/?t=stories&q=bitcoin&page=2 loaded through Selenium

 remoteSelenium$navigate(uri) # send selenium to page
 html <- unlist(remoteSelenium$getPageSource()) # read in page contents

I get...

> html

               applicationCacheEnabled                              rotatable                          handlesAlerts                        databaseEnabled                                version 
                                "TRUE"                                "FALSE"                                 "TRUE"                                 "TRUE"                               "34.0.5" 
                              platform                           nativeEvents                         acceptSslCerts             webdriver.remote.sessionid                      webStorageEnabled 
                                 "MAC"                                "FALSE"                                 "TRUE" "ed06539a-59dc-41a5-ba4e-07b2ed9a9490"                                 "TRUE" 
                locationContextEnabled                            browserName                        takesScreenshot                      javascriptEnabled                    cssSelectorsEnabled 
                                "TRUE"                              "firefox"                                 "TRUE"                                 "TRUE"                                 "TRUE"

... rather than the page source itself. Can't fathom why or how to resolve this as it works fine everywhere else I've tried it. Any thoughts or pointers to other questions/answers?

BarneyC
  • 529
  • 4
  • 17
  • 2
    thank you for reporting this. There was an issue with the mapping of unicode characters. Please try installing the latest dev version: `devtools::install_github("ropensci/RSelenium")` and retry your code. – jdharrison Jan 15 '15 at 14:37
  • Awesome John. That fixed it perfectly. Not sure what Mashable had done to their site over Christmas but with the dev RSelenium it scrapes again. Yay! – BarneyC Jan 15 '15 at 15:36

0 Answers0