Scraping or processing results of SQL API calls

Question

I ma trying to download data and do some analysis from the following Open Data webpage

http://data.ci.newark.nj.us/dataset/new-jersey-education-indicators/resource/d7b23f97-cba5-4c15-997c-37a696395d66

They have given some examples like this query example (via SQL statement)

http://data.ci.newark.nj.us/api/action/datastore_search_sql?sql=SELECT * from "d7b23f97-cba5-4c15-997c-37a696395d66" WHERE title LIKE 'jones'

I am using sqldf package to read the data but unable to succeed.

do you want absolutely to use R ? I would recommend python for this dynamic scrapping ... — Colonel Beauvel, Sep 04 '15 at 13:53
@ColonelBeauvel so you haven't seen `httr`, `rvest`, `xml2` in action, eh? SO isn't the place to start religious wars. — hrbrmstr, Sep 04 '15 at 14:40
unfortunately the example SQL call (on their site, it's what you pasted but it's not your fault) generates an error in their API. Do you _need_ to use SQL? the other parameter-based ones work (i think) — hrbrmstr, Sep 04 '15 at 14:44

hrbrmstr · Accepted Answer · 2015-09-04T15:06:16.367

Rather than resort to rvest and scraping you can use their API directly. As I said, their SQL example errors out, but it doesn't without the WHERE… part (example below). Here are the building blocks for a repeatable process in either straight search or SQL search:

library(jsonlite)
library(httr)

# for passing in a SQL statement
query_nj_sql <- function(sql=NULL) {
  if (is.null(sql)) return(NULL)
  res <- GET("http://data.ci.newark.nj.us/api/action/datastore_search_sql",
             query=list(sql=sql))
  stop_for_status(res) # catches errors
  fromJSON(content(res, as="text"))
}

# for their plain search syntax
query_nj_search <- function(resource_id=NULL, query=NULL, offset=NULL) {
  if (is.null(resource_id)) return(NULL)
  res <- GET("http://data.ci.newark.nj.us/api/action/datastore_search",
             query=list(resource_id=resource_id,
                        offset=NULL,
                        q=query))
  stop_for_status(res) # catches errors
  fromJSON(content(res, as="text"))  
}

# this SQL does not error out
sql_dat <- query_nj_sql('SELECT * from "d7b23f97-cba5-4c15-997c-37a696395d66"')

search_dat <- query_nj_search(resource_id="d7b23f97-cba5-4c15-997c-37a696395d66")

As I said, that SQL query won't error out.

Both calls return a slightly complex list structure that you can examine with:

str(sql_dat)
str(search_dat)

But the records are in there:

dplyr::glimpse(sql_dat$result$records)

## Observations: 545
## Variables: 40
## $ Total population 25 years and over                 (chr) "6389.0", "68.0", "4197.0", "389.0", "1211.0", "4...
## $ Male - Associate's degree                          (chr) "286.0", "0.0", "63.0", "6.0", "69.0", "31.0", "7...
## $ Male - Master's degree                             (chr) "148.0", "29.0", "379.0", "17.0", "79.0", "24.0",...
## $ Male - 7th and 8th grade                           (chr) "49.0", "0.0", "16.0", "2.0", "14.0", "0.0", "0.0...
## $ Female - High school graduate, GED, or alternative (chr) "915.0", "0.0", "426.0", "46.0", "174.0", "30.0",...
## $ Male - 11th grade                                  (chr) "88.0", "0.0", "12.0", "0.0", "3.0", "0.0", "0.0"...
## $ Male - Bachelor's degree                           (chr) "561.0", "0.0", "878.0", "93.0", "137.0", "58.0",...
## $ Male - Some college, 1 or more years, no degree    (chr) "403.0", "0.0", "179.0", "23.0", "39.0", "0.0", "...
… (this goes on a while)

The API looks like it may paginate, so you may have to deal with that (hence the offset parameter).

Since the NJ Edu API supports OData queries, you may be able to use the RSocrata package as well.

score 0 · Answer 2 · answered Sep 04 '15 at 14:31

It doesn't look like their SQL example is working. But I don't think you need to even use sqldf for this, you can just pull in the data using the RCurl package.

If you want to try a different example, you could use the html API calls they have:

library(RCurl)
web <- "http://data.ci.newark.nj.us/api/action/datastore_search?resource_id=d7b23f97-cba5-4c15-997c-37a696395d66&q=jones"
page <- getURL(web)

And then use html parsing to make the content more understandable.

Scraping or processing results of SQL API calls

2 Answers2