0

I am using RSelenium to scrape data off of a [website][1] that has a dynamic form where the multiple dropdown menus change depending on what is chosen. I am trying to pull the variable 'Number & Area of Operational Holdings' for every district in every state.

I am able to get the code working, but have an issue when the district does not have a table (The websites database has a few districts with no data). When my code runs into a district with no data, it finishes and I am left with an incomplete dataset.

How would I create a code that can skip over these districts that lack a table? My code is pasted below. A special shout out goes to the previous stack exchange thread on this, [link here][2], as I adapted their code. Also, if anyone can clean up my final output to avoid repeating the variable headers with every new district, it would be appreciated.

rm(list=ls(all=TRUE))

library(RSelenium)
library(XML)
library(dplyr)
library(magrittr)
library(devtools)
library(rvest)

# Start Selenium Server --------------------------------------------------------

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()


# Simulate browser session and fill out form -----------------------------------

remDrv$navigate('http://agcensus.dacnet.nic.in/districtsummarytype.aspx')

# Select year
remDrv$findElement(using = "xpath", 
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList2']/option[@value = '2010']")$clickElement()

# Select 1 == Number & Area of Operational Holdings
remDrv$findElement(using = "xpath",
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '1']")$clickElement()

# Select 4 == All Social Group 
remDrv$findElement(using = "xpath",
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList4']/option[@value = '4']")$clickElement()

# Select 3 == All Gender (Total) 
remDrv$findElement(using = "xpath",
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList8']/option[@value = '3']")$clickElement()

# Get all state IDs and the respective names
state_IDs <- remDrv$findElements(using = "xpath",
                                 "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option") %>%
  lapply(function(x){x$getElementAttribute('value')}) %>% 
  unlist

state_names <- remDrv$findElements(using = "xpath",
                                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option") %>%
  lapply(function(x){x$getElementText()}) %>% 
  unlist


# Retrieve and download results ------------------------------------------------

result <- data.frame(state = character(), district = character(), 
                     V1 = character(), V2 = character(), V3 = character(),
                     V4 = character(), V5 = character(), V6 = character(),
                     V7 = character(), V8 = character(), V9 = character(),
                     V10 = character(), V11 = character(), V12 = character())

for (i in seq_along(state_IDs)) {

  remDrv$findElement(using = "xpath",
                     paste0("//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = ", 
                            "'", state_IDs[i], "']"))$clickElement()
  Sys.sleep(2)

  # Get all district IDs and names from the currently selected states
  district_IDs <- remDrv$findElements(using = "xpath",
                                      "//div[@id = '_ctl0_ContentPlaceHolder1_Panel14']/select/option") %>%
    lapply(function(x){x$getElementAttribute('value')}) %>%
    unlist

  district_names <- remDrv$findElements(using = "xpath",
                                        "//div[@id = '_ctl0_ContentPlaceHolder1_Panel14']/select/option") %>%
    lapply(function(x){x$getElementText()}) %>%
    unlist


  for (j in seq_along(district_IDs)) {

    remDrv$findElement(using = "xpath",
                       paste0("//div[@id = '_ctl0_ContentPlaceHolder1_Panel14']/select/option[@value = ",
                              "'", district_IDs[j], "']"))$clickElement()
    Sys.sleep(2)

    # Click submit and download data of the selected district
    remDrv$findElement(using = "xpath",
                       "//input[@value = 'Submit']")$clickElement()
    Sys.sleep(2)

    ######### if ##########
    if (remDrv$findElement("xpath", "//input[@value ='No Records found'")) { #this isnt input value, but rather a "No Records found" lookup
      remDrv$goBack()
      Sys.sleep(2)
    } 
    else {

    # Download data for current district
    district_data <- remDrv$getPageSource()[[1]] %>% 
      htmlParse %>% 
      readHTMLTable %>% 
      extract2(4) %>% 
      extract(c(-1, -2), )

    result <- data.frame(state = state_names[i], district = district_names[j],
                         district_data) %>% rbind(result, .)

    remDrv$goBack()
    Sys.sleep(2)
    }
  }
}

remDrv$quit()
remDrv$closeServer()

result %<>% as_data_frame %>%
  rename(
    si_no = V1,
    holding_size = V2, 
    Individual_Number = V3,
    Individual_Area = V4,
    Joint_Number = V5,
    Joint_Area = V6,
    Subtotal_Number = V7,
    Subtotal_Area = V8,
    Institutional_Number = V9,
    Institutional_Area = V10,
    Total_Number = V11,
    Total_Area = V12
  ) %>% 
  mutate(
    si_no = as.numeric(as.character(si_no))
  )

str(result)
levels(result$state)
levels(result$district)
Weevils
  • 312
  • 3
  • 9
  • you can try using an XPath boolean test as I've shown here before: http://stackoverflow.com/questions/26702569/rvest-error-error-in-classout-xmlnodeset-attempt-to-set-an-attribute/26702887#26702887 – hrbrmstr Oct 04 '15 at 00:45
  • Thank you for your response. Sorry for a basic question, but I am a new to web scarping. Where would I put the boolean function and how can I use it in my code? Thanks for your help! – Weevils Oct 04 '15 at 02:10
  • I tried to incorporate the 'if' statement from the solution @hrbrmst suggested (which looks really promising), but cannot get it to work properly. I get an error: Error: Summary: InvalidSelector Detail: Argument was an invalid selector (e.g. XPath/CSS). class: org.openqa.selenium.InvalidSelectorException Any help on how to make the 'if' statement work? Please note, I've updated the code above and included the 'if' statement mentioned just after the comment: '######### if ##########'. – Weevils Oct 07 '15 at 16:56

0 Answers0