2

I have a sqlite database file with several columns. One of the columns has a JSON dictionary (with two keys) embedded in it. I want to extract the JSON column to a data frame in R that shows each key in a separate column.

I tried rjson::fromJSON, but it reads only the first item. Is there a trick that I'm missing?

Here's an example that mimics my problem:

> eg <- as.vector(c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}"))
> fromJSON(eg)

$3x [1] 20

$6y [1] 23

The desired output is something like:

# a data frame for both variables
  3x 6y
1 20 23
2 60 50

or,

# a data frame for each variable
  3x 
1 20
2 60

  6y
1 23
2 50
N.Hamoud
  • 33
  • 6
  • Does not do the same for me; I get `parse error: trailing garbage`, as if it is concatenating the strings. Regardless, what are you expecting? Are you looking for `lapply(eg, fromJSON)`? (I'm on `jsonlite-1.5`, if that matters.) – r2evans Feb 14 '18 at 22:43
  • Thanks @r2evans, jsonlite gives me the same error you had, rjson does not. I tried lapply, but it produces as many lists as the number of items. My goal is to have a list for each key, one for "1x" and one for "5y". – N.Hamoud Feb 15 '18 at 00:26
  • That's completely different ... you need to edit your question and insert your desired output. The answer will likely start with my `lapply` suggest, and then something using `Reduce` might be useful. (And sorry, I missed the `rjson::` in your question, my bad.) – r2evans Feb 15 '18 at 00:38
  • Thanks @r2evans, the desired output is added. I'll check reduce(), and I think paste() may also do it. I'll give it a try. – N.Hamoud Feb 15 '18 at 01:52

1 Answers1

1

What you are looking for is actually a combination of lapply and some application of rbind or related.

I'll extend your data a little, just to have more than 2 elements.

eg <- c("{\"3x\": 20, \"6y\": 23}",
        "{\"3x\": 60, \"6y\": 50}",
        "{\"3x\": 99, \"6y\": 72}")

library(jsonlite)

Using base R, we can do

do.call(rbind.data.frame, lapply(eg, fromJSON))
#   X3x X6y
# 1  20  23
# 2  60  50
# 3  99  72

You might be tempted to do something like Reduce(rbind, lapply(eg, fromJSON)), but the notable difference is that in the Reduce model, rbind is called "N-1" times, where "N" is the number of elements in eg; this results in a LOT of copying of data, and though it might work alright with small "N", it scales horribly. With the do.call option, rbind is called exactly once.

Notice that the column labels have been R-ized, since data.frame column names should not start with numbers. (It is possible, but generally discouraged.)

If you're confident that all substrings will have exactly the same elements, then you may be good here. If there's a chance that there will be a difference at some point, perhaps

eg <- c(eg, "{\"3x\": 99}")

then you'll notice that the base R solution no longer works by default.

do.call(rbind.data.frame, lapply(eg, fromJSON))
# Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors())  : 
#   numbers of columns of arguments do not match

There may be techniques to try to normalize the elements such that you can be assured of matches. However, if you're not averse to a tidyverse package:

library(dplyr)
eg2 <- bind_rows(lapply(eg, fromJSON))
eg2
# # A tibble: 4 × 2
#    `3x`  `6y`
#   <int> <int>
# 1    20    23
# 2    60    50
# 3    99    72
# 4    99    NA

though you cannot call it as directly with the dollar-method, you can still use [[ or backticks.

eg2$3x
# Error: unexpected numeric constant in "eg2$3"
eg2[["3x"]]
# [1] 20 60 99 99
eg2$`3x`
# [1] 20 60 99 99
r2evans
  • 141,215
  • 6
  • 77
  • 149