0

I have the following code I borrowed from a previous Stackoverflow discussion ( Extracting data from javascript with R). I'm bassically trying to webscrape some data for some pharmaceuticals. When I run the code for a single pharmaceutical code (2203) it works just fine!

appURL <- "http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp"
library(RSelenium)
pJS <- phantom(extras = c('--ssl-protocol=tlsv1'))
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
Sys.sleep(1) # give the binary a moment
remDr$navigate(appURL)
# Get the third list item of the select box (MEDICAMENTOS)
webElem <- remDr$findElement("css", "select[name='grupo'] option:nth-child(3)")
webElem$clickElement() # select this element
# Send text to input value="" name="expediente
webElem <- remDr$findElement("css", "input[name='expediente']")
webElem$sendKeysToElement(list(2203))
# Click the Buscar button
remDr$findElement("id", "INPUT2")$clickElement()
Sys.sleep(3) # give the binary a moment
remDr$switchToFrame(remDr$findElement("css", "iframe[name='datos']"))
remDr$findElement("css", "a")$clickElement() # click the link given in the iframe

# get the resulting data

appData <- remDr$getPageSource()[[1]]
# close phantom js
pJS$stop()

But when I put it inside a loop, so that I can retrieve the information for all the pharmaceuticals I need... it breaks down. Below is the code inside the loop.

for(cum in 2203){
appURL <- "http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp"
library(RSelenium)
pJS <- phantom(extras = c('--ssl-protocol=tlsv1'))
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
Sys.sleep(1) # give the binary a moment
remDr$navigate(appURL)
# Get the third list item of the select box (MEDICAMENTOS)
webElem <- remDr$findElement("css", "select[name='grupo'] option:nth-child(3)")
webElem$clickElement() # select this element
# Send text to input value="" name="expediente
webElem <- remDr$findElement("css", "input[name='expediente']")
webElem$sendKeysToElement(list(cum))
# Click the Buscar button
remDr$findElement("id", "INPUT2")$clickElement()
Sys.sleep(3) # give the binary a moment
remDr$switchToFrame(remDr$findElement("css", "iframe[name='datos']"))
remDr$findElement("css", "a")$clickElement() # click the link given in the iframe

# get the resulting data

appData <- remDr$getPageSource()[[1]]
# close phantom js
pJS$stop()
readHTMLTable(appData, which = 3)

}

Any ideas on whats going on? I tried giving phantom time to do stuff, as I have heard that could be a problem but it didn't work. It doesnt work with a single code 2203 or with two c(2202,2203)

Community
  • 1
  • 1

1 Answers1

1

It seems the website is now testing for a useragent:

appURL <- "http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp"
library(RSelenium)
pJS <- phantom(extras = c('--ssl-protocol=tlsv1'))
Sys.sleep(5) # give the binary a moment
for(cum in "2203"){
  eCap <- list(phantomjs.page.settings.userAgent 
               = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0")
  remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
  remDr$open()
  Sys.sleep(1) # give the binary a moment
  remDr$navigate(appURL)
  # Get the third list item of the select box (MEDICAMENTOS)
  webElem <- remDr$findElement("css", "select[name='grupo'] option:nth-child(3)")
  webElem$clickElement() # select this element
  # Send text to input value="" name="expediente
  webElem <- remDr$findElement("css", "input[name='expediente']")
  webElem$sendKeysToElement(list(cum))
  # Click the Buscar button
  remDr$findElement("id", "INPUT2")$clickElement()
  Sys.sleep(3) # give the binary a moment
  remDr$phantomExecute("var page = this;
                  page.switchToFrame('datos');
                  page.evaluate(function() {
                    document.querySelector('a').click();
                  });
  ")

  # get the resulting data
  Sys.sleep(3) # give the binary a moment

  appData <- remDr$getPageSource()[[1]]
  # close phantom js
  readHTMLTable(appData, which = 3) 
  remDr$close()
}
pJS$stop()
jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • thanks for creating the package... It has been a life saver! I literally copied and pasted your code, but it doesn't work. I'm not sure if its because I need to set the useragent to something specific...I'm listening to your webinar http://cran.r-project.org/web/packages/RSelenium/vignettes/OCRUG-webinar.html right now and hopefully I might learn something I'm missing. – Mauricio Romero Apr 16 '15 at 15:20
  • @MauricioRomero there appears to be issues with phantomjs2 and webdriver executing javascript in an iframe. I have given a work around using `phantomExecute`. However at this point phantomJS2 appears slightly buggy in webdriver mode. You can switch back to phantom 1.98 or use a different browser (chrome/firefox). – jdharrison Apr 17 '15 at 04:30