3

I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.

More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.

My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?

I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.

library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)

test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)

# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()

# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()

remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()

# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)

ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()

You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.

Showing just a small chunk, choosing "BEARFACTS" in rvest is:

BEARFACTS\n                                    \n                                                \n                                    

while in RSelenium it is something like the following :

<li class=\"expanded dropdown\">\n                    <a href=\"https://apps.bea.gov/regional/bearfacts/\">BEARFACTS</a>\n  
Hong
  • 574
  • 3
  • 10
  • 2
    Please always check, whether scraping a page is permitted by its owner. `robotstxt::paths_allowed(test_url)` yields `FALSE`, you should therefore not use it as an example. – Thomas K Aug 06 '19 at 07:32
  • 1
    Thanks for the heads up! It's a useful package that I will definitely use. I have changed the example accordingly. – Hong Aug 06 '19 at 20:22

1 Answers1

4

The difference between RSelenium and rvest is:

  • RSelenium runs a real web browser, so it will load any javascript contained in the webpage (javascript is often used to load additional html elements or data after the initial html has loaded).
  • rvest does not run javascript, and therefore retrieves the page html faster, but will miss any elements loaded with javascript after the initial page load.

Some useful tips:

  • When scraping a page that doesn't load javascript, use rvest.
  • When you must use RSelenium, try using a headless option to improve speed (it will load the page in a browser just like normal, but it won't display any of the graphical elements, so it will be faster).

Example of using RSelenium headless

eCaps <- list(chromeOptions = list(
  args = c('--headless', '--disable-gpu', '--window-size=1280,800')
))

rD <- rsDriver(browser=c("chrome"), verbose = TRUE, chromever="78.0.3904.105", port=4447L, extraCapabilities = eCaps) 
stevec
  • 41,291
  • 27
  • 223
  • 311
  • Could you clearify how to do a headless option? I'm scraping a page for newspaper articles, for a thesis, and it would make sense that the final code did'nt load the actual page. Thank you – Anders Jørgensen Apr 11 '21 at 11:22
  • @AndersJørgensen sure, I'll add it now – stevec Apr 11 '21 at 11:36
  • @AndersJørgensen if you have a specific url, let me know and I'll use it as the example – stevec Apr 11 '21 at 11:36
  • I'm newspaper's main page is: https://www.berlingske.dk/. – Anders Jørgensen Apr 11 '21 at 12:38
  • Some of the code for the project is done, and just needs to be put together in a function. I'm not as talented in coding as many in here, so I'll add everything in an answer below, and hope some kind soul have suggestions to how I can put it together to make it work. – Anders Jørgensen Apr 11 '21 at 12:44
  • https://stackoverflow.com/questions/67045717/how-is-rselenium-made-to-iterate-over-searchresults-and-retrieve-the-newspaperar – Anders Jørgensen Apr 11 '21 at 13:47
  • Would be helpful to see an example of how to do a headless option without loading and showing the pages @stevec. I don't want to sound pushy but it would really help me. I hope you have a wonderful day. Thank you – Anders Jørgensen Apr 12 '21 at 06:21
  • @AndersJørgensen I added a way (using chrome), but I see you're using firefox, so it would probably be a bit different – stevec Apr 12 '21 at 06:34
  • 1
    Ahh I didn't see your edit. Thank you anyway – Anders Jørgensen Apr 12 '21 at 06:37
  • @AndersJørgensen if a google search doesn't show how to do firefox headless, it's definitely a good question to ask as a separate question, feel free to link here – stevec Apr 12 '21 at 06:39