1

I am new to web scraping in R and have recently ran into a problem with sites that reference javascript. I am attempting to scrape the data from a web page below and have been unsuccessful. I believe that the javascript links prevent me from accessing the table. As a result the R package "XML" with function "readHTMLTable" comes up null.

library(XML)
library(RCurl)
url <- "http://votingrights.news21.com/interactive/movement-voter-id/index.html"
tabs <- getURL(url)
tabs <- htmlParse(url)
tabs <- readHTMLTable(tabs, stringsAsFactors = FALSE)

How can I access the javascript links to get to the data? Or is this even possible? When using the direct link to the data (below) and the R package "rjson" I am still unable to read in the data.

library("rjson")
json_file <- "http://votingrights.news21.com/static/interactives/movement/data/fulldata.js"
lines <- readLines(json_file)
json_data <- fromJSON(lines, collapse="")
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
C.Bright
  • 27
  • 2

1 Answers1

3

The file you reference is a javascript file containing JSON rather then JSON. In this case you can manually scrub the contents to get the data:

library("rjson")
json_file <- "http://votingrights.news21.com/static/interactives/movement/data/fulldata.js"
lines <- readLines(json_file)
lines[1] <- sub(".* = (.*)", "\\1", lines[1])
lines[length(lines)] <- sub(";", "", lines[length(lines)])
json_data <- fromJSON(paste(lines, collapse="\n"))
> head(json_data[[1]][[1]])
$state
[1] "Alabama"

$bill
[1] "HB 19"

$category
[1] "Strict photo ID"

$introduced
[1] "Mar 1, 2011"

$house
[1] "Yes"

$senate
[1] "Yes"

If you want to interact with the javascript data on the webpage you can use Selenium:

library(RSelenium)
appURL <- "http://votingrights.news21.com/static/interactives/movement/index.html"
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
remDr$navigate(appURL)
fullData <- remDr$executeScript("return fullData;")
pJS$stop()
> head(fullData[[1]][[1]])
$state
[1] "Alabama"

$bill
[1] "HB 19"

$category
[1] "Strict photo ID"

$introduced
[1] "Mar 1, 2011"

$house
[1] "Yes"

$senate
[1] "Yes"
jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • Thank you! I had tried this before, but I missed the step where you sub the ";" out, so wasn't able to get it to work. This solution works well. I am wondering, however, if there is a package that will read in this type of scrip without having to manually scrub the contents each time... – C.Bright Dec 06 '14 at 02:09
  • 2
    You can use Selenium and access the javascript data directly. See various vignettes at http://cran.r-project.org/web/packages/RSelenium/index.html – jdharrison Dec 06 '14 at 02:29
  • I had a similar problem a couple days ago. I went with RSelenium and solved it well. You might find interesting to see http://stackoverflow.com/questions/27305824/extracting-data-from-javascript-with-r/27308368#27308368 – PavoDive Dec 06 '14 at 14:00
  • Thank you. I will check out RSelenium. – C.Bright Dec 07 '14 at 01:05