0

I have an ndjson data source. For a simple example, consider a text file with three lines, each containing a valid json message. I want to extract 7 variables from the messages and put them in a dataframe.

Please use the following sample data in a text file. You can paste this data into a text editor and save it as "ndjson_sample.txt"

{"ts":"1","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":12353,\"Var5\":1,\"Var6\":\"abc\",\"Var7\":\"x\"}"}
{"ts":"2","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-68,\"Var4\":4528,\"Var5\":1,\"Var6\":\"def\",\"Var7\":\"y\"}"}
{"ts":"3","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":-5409,\"Var5\":1,\"Var6\":\"ghi\",\"Var7\":\"z\"}"}

The following three lines of code accomplish what I want to do:

file1 <- "ndjson_sample.txt"
json_data1 <- ndjson::stream_in(file1)
raw_df_temp1 <- as.data.frame(ndjson::flatten(json_data1$ct))

For reasons I won't get into, I cannot use the ndjson package. I must find a way to use the jsonlite package to do the same thing using the stream_in() and stream_out() functions. Here's what I tried:

con_in1 <- file(file1, open = "rt")
con_out1 <- file(tmp <- tempfile(), open = "wt")
callback_func <- function(df){
  jsonlite::stream_out(df, con_out1, pagesize = 1)
}
jsonlite::stream_in(con_in1, handler = callback_func, pagesize = 1)
close(con_out1)
con_in2 <- file(tmp, open = "rt")
raw_df_temp2 <- jsonlite::stream_in(con_in2)

This is not giving me the same data frame as a final output. Can you tell me what I'm doing wrong and what I have to change to make raw_df_temp1 equal raw_df_temp2?

I could potentially solve this with a the fromJSON() functions operating on each line of the file, but I'd like to find a way to do it with the stream functions. The files I will be dealing with a are quite large and so efficiency will be key. I need this to be as fast as possible.

Thank you in advance.

ds_guy
  • 143
  • 2
  • 5
  • Is there a specific reason you're trying to match input with `stream_out(stream_in(...),...)`? You say it *"is not giving me the same data frame"*, what do you get? (please edit your question for that last one) – r2evans Apr 20 '18 at 16:07
  • I want `raw_df_temp1` to match `raw_df_temp2`. Perhaps what I'm trying to do with opening and closing the stream connection is not correct. I'm asking what is the right way to do this using `jsonlite` and the `stream_in` and `stream_out` functions. Currently, my `raw_dr_temp2` dataframe contains two variables `ts` and `ct`. Under `ct` you'll find the valid JSON message I want to pull the data from. I'm looking for the most efficient way to pull this data and populate a dataframe...similar to the way it is done in `ndjson` package. – ds_guy Apr 20 '18 at 16:16

1 Answers1

1

Currently under ct you'll find a string that can (subsequently) be fed to fromJSON independently, but it will not be parsed as such. Ignoring your stream_out(stream_in(...),...) test, here are a couple of ways to read it in:

library(jsonlite)
json <- stream_in(file('ds_guy.ndjson'), simplifyDataFrame=FALSE)
# opening file input connection.
#  Imported 3 records. Simplifying...
# closing file input connection.
cbind(
  ts = sapply(json, `[[`, "ts"),
  do.call(rbind.data.frame, lapply(json, function(a) fromJSON(a$ct)))
)
#   ts Var1 Var2 Var3  Var4 Var5 Var6 Var7
# 1  1    6    6  -70 12353    1  abc    x
# 2  2    6    6  -68  4528    1  def    y
# 3  3    6    6  -70 -5409    1  ghi    z

Calling fromJSON on each string might be cumbersome, and with larger data this slow-down is why there is stream_in, so if we can capture the "ct" component into a stream of its own, then ...

writeLines(sapply(json, `[[`, "ct"), 'ds_guy2.ndjson')

(There are far-more-efficient ways to do this with non-R tools, including perhaps a simple

sed -e 's/.*"ct":"\({.*\}\)"}$/\1/g' -e 's/\\"/"/g' ds_guy.ndjson > ds_guy.ndjson2

though this makes a few assumptions about the data that may not be perfectly safe. A better solution would be to use jq, which should "always" correctly-parse proper json, then a quick sed to replace escaped quotes:

jq '.ct' ds_guy.ndjson | sed -e 's/\\"/"/g' > ds_guy2.ndjson

and you can do that with system(...) in R if needed.)

From there, under the assumption that each line will contain exactly one row of data.frame data:

json2 <- stream_in(file('ds_guy2.ndjson'), simplifyDataFrame=TRUE)
# opening file input connection.
#  Imported 3 records. Simplifying...
# closing file input connection.
cbind(ts=sapply(json, `[[`, "ts"), json2)
#   ts Var1 Var2 Var3  Var4 Var5 Var6 Var7
# 1  1    6    6  -70 12353    1  abc    x
# 2  2    6    6  -68  4528    1  def    y
# 3  3    6    6  -70 -5409    1  ghi    z

NB: in the first example, "ts" is a factor, all others are character because that's what fromJSON gives. In the second example, all strings are factor. This can easily be addressed through judicious use of stringsAsFactors=FALSE, depending on your needs.

r2evans
  • 141,215
  • 6
  • 77
  • 149