2

Thanks for taking interest in this.

I was given the [tedious] task to look what is the country of origin of some medicins, as they are registered with the colombian food and drug administration. The agency uses a website with a javascript (.jsp extension) and I would like to know if it is possible to automate the process. This is the step by step of the lookup:

  1. Go to agency's website: Agency's consult site
  2. Select "Medicamentos" in the droplist in the left
  3. Under "expendiente" (rigthmost box in the top) write the number we're looking for (two of the 900+ I have to check are: 2203 and 3519). Radio-button selection is indifferent.
  4. hit search button ("buscar")
  5. Click the link presented in the table below
  6. Ideally, get the table line that starts with FABRICANTE (manufacturer), but being able to save the document would be enough (I plan to get/clean/analyze the data using R later on).
  7. Hit the clean button ("nueva consulta")
  8. Start all over from steps 3 to 7.

I don't have the slightest idea whether this could be accomplished, and if so, how; so I'd appreciate any guidance that allow me to start in any direction (other than the one I have at hand now: looking them by hand!). I'm familiar with R and some VB, but if it's possible in any other language, I'll give it a try.

What I've tried:

  • I tried to find any information related to extracting data from javascript, but most of what I've found is related to using javascript to pass data from different sort of databases into html/xml; or extrating the data from only one response (that's not the part I want to automate, as once I'm at the response, it would be easy to only look at the value [county of origin]. The "consult" part is the hardest!). I've felt so off-track that I think I'm clueless as to how to search adequately. Guidance / ideas /starters are much appreciated
  • I've opened the agency's site with the inspector (firefox), but stoped just after finding that the variable "expediente" is the one that gets the value for "expediente" (not very useful!). I don't know if possible (and how to) iterate on the page to change the value for that variable.

Thanks!

Jaap
  • 81,064
  • 34
  • 182
  • 193
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • 1
    One of the selenium packages for R is your best bet. That site has gone to great lengths to prevent scraping. – hrbrmstr Dec 05 '14 at 01:11
  • @hrbmstr Thanks for the lead. I installed RSelenium and so far have been able to open the page, write the numbers in the box and clear the form again. I would like to get the value in one cell of the results table, which unique selector is `body > form:nth-child(1) > table:nth-child(4) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(6)` but haven't found the way to do it. I tried using `findElement("css selector",uniqueselector)` and `findElements("css selector",uniqueselector)` but none of those is. Would you give me another bit of wisdom here? Thanks! – PavoDive Dec 05 '14 at 02:59
  • When I do the above explained code, I get an empty list. If I put `list(uniqueselector)` instead of `uniqueselector`, then I get a `java.lang.ClassCastException`. When I do a `htmlParse(rd$getPageSource()[[1]])` I get a lot of things, but not the contents of the table (which is what I'm interested in). Thanks again – PavoDive Dec 05 '14 at 03:04

1 Answers1

4

I have used phantomjs with the RSelenium package. Details on how to setup phantomjs can be found at http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-saucelabs.html#id2a phantomjs can be driven directly without the need for a Selenium Server details here . It should be alot quicker for the task you outline due to its headless nature.

The first part of your question can be achieved as follows:

appURL <- "http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp"
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
remDr$navigate(appURL)
# Get the third list item of the select box (MEDICAMENTOS)
webElem <- remDr$findElement("css", "select[name='grupo'] option:nth-child(3)")
webElem$clickElement() # select this element
# Send text to input value="" name="expediente
webElem <- remDr$findElement("css", "input[name='expediente']")
webElem$sendKeysToElement(list(2203))
# Click the Buscar button
remDr$findElement("id", "INPUT2")$clickElement()

Now the form has been filled in and the link clicked. The data is in an iframe with name="datos". Iframes need to be switched to:

# switch to datos iframe
remDr$switchToFrame(remDr$findElement("css", "iframe[name='datos']"))
remDr$findElement("css", "a")$clickElement() # click the link given in the iframe

# get the resulting data

appData <- remDr$getPageSource()[[1]]
# close phantom js
pJS$stop()

The data for the iframe is now contained in appData. As an example we look at the third table using the simple extraction function readHTMLTable:

readHTMLTable(appData, which = 3)
V1     V2      V3              V4       V5                      V6
1 Presentacion Comercial   <NA>    <NA>            <NA>     <NA>                    <NA>
  2             Expediente Consec Termino Unidad / Medida Cantidad             Descripcion
3              000002203     01    0176              ml    60,00  FRASCO AMBAR POR 60 ML
4              000002203     02    0176              ml   120,00 FRASCO AMBAR POR 120 ML
5              000002203     03    0176              ml    90,00  FRASCO AMBAR POR 90 ML
V7     V8            V9
1       <NA>   <NA>          <NA>
  2 Fecha insc Estado Fecha Inactiv
3 2007/01/30 Activo              
4 2007/01/30 Activo              
5 2012/03/15 Activo 
jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • 1
    you made my day, thanks! I didn't use the `phantomjs`, but built upon what I had on RSelenium. The last part (changing the iframe) and using readHTMLtable are keys, I was seriuosly struggling to get to one of the table elements! – PavoDive Dec 05 '14 at 04:35