How to download and/or extract data stored in a 'raw' binary zip object within a response object in R?

Question

I am unable to download or read a zip file from an API request using httr package. Is there another package I can try that will allow me to download/read binary zip files stored within the response of a get request in R?

I tried two ways:

used GET to get an application/json type response object (successful) and then used fromJSON to extract content using content(my_response, 'text'). The output includes a column called 'zip' which is the data I'm interested in downloading, which documentation states is a base64 encoded binary file. This column is currently a really long string of random letters and I'm not sure how to convert this to the actual dataset.
I tried bypassing using fromJSON because I noticed there is a field of class 'raw' within the response object itself. This object is a list of random numbers which I suspect are the binary representation of the dataset. I tried using rawToChar(my_response$content) to try to convert the raw data type into character, but this results in the same long character string being produced as in #1.
I noticed that with approach #1, if I use base64_dec() to try to convert the long character string I also get the same type of output as the 'raw' field within the response object itself.

getzip1  <- GET(getzip1_link)
getzip1 # successful response, status 200
df <- fromJSON(content(getzip1, "text"))

df$status # "OK"
df$dataset$zip # <- this is the very long string of letters (eg. "I1NC5qc29uUEsBAhQDFA...")

# Method 1: try to convert from the 'zip' object in the output of fromJSON
try1 <- base64_dec(df$dataset$zip)
#looks similar to getzip1$content (i.e.  this produces the list of numbers/letters 50 4b 03 04 14 00, etc, perhaps binary representation)

# Method 2: try to get data directly from raw object
class(getzip1$content) # <- 'raw' class object directly from GET request
try2 <- rawToChar(getzip1$content) #returns same output as df$data$zip

I should be able to use either the raw 'content' object from my response or the long character string in the 'zip' object of the output of fromJSON in order to view the dataset or somehow download it. I don't know how to do this. Please help!

Have you tried just a plan simple download of the zip file eg. via getZip from the Hmisc package: `read.csv(Hmisc::getZip('http://biostat.mc.vanderbilt.edu/twiki/pub/Sandbox/WebHome/z.zip'))` — GWD, Oct 21 '19 at 17:58
The issue is I don't only have 1 destination link, I actually need to cycle through a bunch of dataset keys and use the GET request to pull down a bunch of zip files. I could inspect the HTML of the website to scrape the list of all of the zip links that I want and perform the above, but I don't want to have to do that because that's what the API is for. — Kateryna Savchyn, Oct 21 '19 at 18:24

José Bayoán Santiago Calderón · Accepted Answer · 2019-10-24T15:24:34.307

welcome!

Based on the documentation for the API the response to the getDataset endpoint has schema

Dataset archive including meta information, the dataset itself is base64 encoded to allow for binary ZIP transfers.

{
 "status": "OK",
 "dataset": {
 "state_id": 5,
 "session_id": 1624,
 "session_name": "2019-2020 Regular Session",
 "dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
 "dataset_date": "2018-12-23",
 "dataset_size": 317775,
 "mime": "application\/zip",
 "zip": "MIME 64 Encoded Document"
 }
}

We can use R for obtaining the data with the following code,

library(httr)
library(jsonlite)
library(stringr)
library(maditr)
token <- "" # Your API key
session_id <- 1253L # Obtained from the getDatasetList endpoint
access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile <- file.path("path", "to", "file.zip") # Modify
response <- str_c("https://api.legiscan.com/?key=",
                  token,
                  "&op=getDataset&id=",
                  session_id,
                  "&access_key=",
                  access_key) %>%
  GET()
status_code(x = response) == 200 # Good
body <- content(x = response,
                as = "text",
                encoding = "utf8") %>%
  fromJSON() # This contains some extra metadata
content(x = response,
        as = "text",
        encoding = "utf8") %>%
  fromJSON() %>%
  getElement(name = "dataset") %>%
  getElement(name = "zip") %>%
  base64_dec() %>%
  writeBin(con = destfile)
unzip(zipfile = destfile)

unzip will unzip the files which in this case will look like

hash.md5 # Can be checked against the metadata
AL/2016-2016_1st_Special_Session/bill/*.json
AL/2016-2016_1st_Special_Session/people/*.json
AL/2016-2016_1st_Special_Session/vote/*.json

As always, wrap your code in functions and profit.

PS: Here is how the code would like like in Julia as a comparison.

using Base64, HTTP, JSON3, CodecZlib
token = "" # Your API key
session_id = 1253 # Obtained from the getDatasetList endpoint
access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile = joinpath("path", "to", "file.zip") # Modify
response = string("https://api.legiscan.com/?",
                  join(["key=$token",
                        "op=getDataset",
                        "id=$session_id",
                        "access_key=$access_key"],
                        "&")) |>
    HTTP.get
@assert response.status == 200
JSON3.read(response.body) |>
    (content -> content.dataset.zip) |>
    base64decode |>
    (data -> write(destfile, data))
run(pipeline(`unzip`, destfile))

How to download and/or extract data stored in a 'raw' binary zip object within a response object in R?

1 Answers1

Linked