Rather than resort to rvest
and scraping you can use their API directly. As I said, their SQL example errors out, but it doesn't without the WHERE…
part (example below). Here are the building blocks for a repeatable process in either straight search or SQL search:
library(jsonlite)
library(httr)
# for passing in a SQL statement
query_nj_sql <- function(sql=NULL) {
if (is.null(sql)) return(NULL)
res <- GET("http://data.ci.newark.nj.us/api/action/datastore_search_sql",
query=list(sql=sql))
stop_for_status(res) # catches errors
fromJSON(content(res, as="text"))
}
# for their plain search syntax
query_nj_search <- function(resource_id=NULL, query=NULL, offset=NULL) {
if (is.null(resource_id)) return(NULL)
res <- GET("http://data.ci.newark.nj.us/api/action/datastore_search",
query=list(resource_id=resource_id,
offset=NULL,
q=query))
stop_for_status(res) # catches errors
fromJSON(content(res, as="text"))
}
# this SQL does not error out
sql_dat <- query_nj_sql('SELECT * from "d7b23f97-cba5-4c15-997c-37a696395d66"')
search_dat <- query_nj_search(resource_id="d7b23f97-cba5-4c15-997c-37a696395d66")
As I said, that SQL query won't error out.
Both calls return a slightly complex list
structure that you can examine with:
str(sql_dat)
str(search_dat)
But the records are in there:
dplyr::glimpse(sql_dat$result$records)
## Observations: 545
## Variables: 40
## $ Total population 25 years and over (chr) "6389.0", "68.0", "4197.0", "389.0", "1211.0", "4...
## $ Male - Associate's degree (chr) "286.0", "0.0", "63.0", "6.0", "69.0", "31.0", "7...
## $ Male - Master's degree (chr) "148.0", "29.0", "379.0", "17.0", "79.0", "24.0",...
## $ Male - 7th and 8th grade (chr) "49.0", "0.0", "16.0", "2.0", "14.0", "0.0", "0.0...
## $ Female - High school graduate, GED, or alternative (chr) "915.0", "0.0", "426.0", "46.0", "174.0", "30.0",...
## $ Male - 11th grade (chr) "88.0", "0.0", "12.0", "0.0", "3.0", "0.0", "0.0"...
## $ Male - Bachelor's degree (chr) "561.0", "0.0", "878.0", "93.0", "137.0", "58.0",...
## $ Male - Some college, 1 or more years, no degree (chr) "403.0", "0.0", "179.0", "23.0", "39.0", "0.0", "...
… (this goes on a while)
The API looks like it may paginate, so you may have to deal with that (hence the offset
parameter).
Since the NJ Edu API supports OData queries, you may be able to use the RSocrata package as well.