0

I am pulling data directly from a Postgres database into R, where one of the columns in the Postgres table contains rows of JSON objects. I am trying to unpack the JSON objects and have them flatten into columns in an R dataframe, but so far, I'm getting mangled results.

Here's my code:

library(RPostgreSQL)
library(jsonlite)

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host="xxx", dbname="xxx", user="xxx", password="xxx")
query="select column1, column2, json from dummy_table limit 2" 
resultSet <- dbSendQuery(con, query)
rawData<-fetch(resultSet,n=-1)
postgresqlCloseConnection(con)

rawData$json
[1]"{\"id\":{\"publisherName\":\"pub1\",\"visitorId\":\"visitor1\",\"timestamp\":1234},\"startAt\":4567,\"endAt\":8910}"
[2]"{\"id\":{\"publisherName\":\"pub2\",\"visitorId\":\"visitor2\",\"timestamp\":2345},\"startAt\":678,\"endAt\":91011}"

unpacked<-fromJSON(rawData$json, simplifyDataFrame=FALSE)

unpacked
$id
$id$publisherName
[1] "pub1"
$id$visitorId
[1] "visitor1"
$id$timestamp
[1] 1234
$startAt
[1] 4567
$endAt
[1] 8910

As you can see, it only unpacked the first JSON object, and it left things quasi-nested (which is fine, but optimally, i would want all the data to live in one level in a dataframe).

I would want the data to look like this:

unpacked

id.publisherName     id.visitorId     id.timestamp     startAt       endAt
pub1                 visitor1         1234             4567          8910
pub2                 visitor2         2345             678           91011

EDIT: Adding the rawData dataframe:

rawData<-structure(list(
  column1 = c("abcd", "efgh"
), 
column2 = structure(c(123, 456), class = c("POSIXct", 
"POSIXt"), tzone = ""), 
json = c("{\"id\":{\"publisherName\":\"pub1\",\"visitorId\":\"visitor1\",\"timestamp\":1234},\"startAt\":4567,\"endAt\":8910}", 
"{\"id\":{\"publisherName\":\"pub2\",\"visitorId\":\"visitor2\",\"timestamp\":2345},\"startAt\":678,\"endAt\":91011}"
))
, .Names = c("column1", "column2", "json"), 
row.names = 1:2, class = "data.frame")

Here's what happens with the paste function.

rawJSON <- paste("[", paste(rawData$json, collapse=","), "]")
rawJSON <- fromJSON(rawJSON, simplifyDataFrame=FALSE)

rawJSON
[[1]]
[[1]]$id
[[1]]$id$publisherName
[1] "pub1"

[[1]]$id$visitorId
[1] "visitor1"

[[1]]$id$timestamp
[1] 1234


[[1]]$startAt
[1] 4567

[[1]]$endAt
[1] 8910


[[2]]
[[2]]$id
[[2]]$id$publisherName
[1] "pub2"

[[2]]$id$visitorId
[1] "visitor2"

[[2]]$id$timestamp
[1] 2345


[[2]]$startAt
[1] 678

[[2]]$endAt
[1] 91011
Bryan
  • 5,999
  • 9
  • 29
  • 50

1 Answers1

1

The fromJSON function assumes that you are feeding it a single complete json string. Character vectors will be collapsed into single string. In your case your data contains multiple separate json objects. So you either need to convert them all individually:

lapply(rawData$json, fromJSON)

Or, to get the result that you're after, use stream_in to parse them as ndjson.

mydata <- jsonlite::stream_in(textConnection(rawData$json))
fromJSON(myjson)

See the jsonlite ?stream_in manual page for more details.

Jeroen Ooms
  • 31,998
  • 35
  • 134
  • 207
  • Thanks -- I've tried using `lapply` and the method you described above, but it doesn't unpack it into a dataframe -- it looks like it's still a list of 2 items, in which is a hierarchy. When I try to coerce the results of fromJSON into a dataframe, I get an error that says 'arguments imply differing number of rows'. I've tried using `lapply` and `unlist` as well on the results, but get gobeldygook too. – Bryan Mar 13 '14 at 17:14
  • Did you try the second solution? Could you maybe post some example json data somewhere? That makes it easier to help you. – Jeroen Ooms Mar 14 '14 at 06:28
  • I did, but to no avail. I'll add a sample dataframe above to help us out. – Bryan Mar 14 '14 at 19:54
  • Based in your example data, the second solution (with `paste`) seems to work? What error do you get when executing http://pastebin.com/embed_iframe.php?i=TgTZKqjz – Jeroen Ooms Mar 14 '14 at 21:21
  • I don't get an error per se when executing you code, but I get the structure described in my post: a list of two items with their own hirearchies. I figured out the difference in our outcomes -- I have to set `simplifyDataFrame=FALSE` in `fromJSON` as the real data has additional hirearchies that simplifyDataFrame is screwing up. So, given that I can't just rely on simplifyDataFrame, what can I do? – Bryan Mar 14 '14 at 21:33
  • I don't really understand the question. If you want the output of `fromJSON` to be a data frame (as your question implies), then you need to use `simplifyDataFrame`. If this doesn't work, can you provide a (minimal) example that shows what you are after and what is not working? – Jeroen Ooms Mar 14 '14 at 22:03
  • @Jeroen creating a json array `[json1,json2]` is also really key to a substantial speed up here, because you don't have to `lapply`. Really helpful. Just out of curiosity: Is there a way to use stream_in with `RPostgreSQL` ? – Matt Bannert Dec 13 '16 at 13:49
  • Yes today you can use `stream_in`. I will update the response. – Jeroen Ooms Dec 14 '16 at 14:22