4

I have a data frame where the values of column Parameters are Json data:

#  Parameters
#1 {"a":0,"b":[10.2,11.5,22.1]}
#2 {"a":3,"b":[4.0,6.2,-3.3]}
...

I want to extract the parameters of each row and append them to the data frame as columns A, B1, B2 and B3.

How can I do it?

I would rather use dplyr if it is possible and efficient.

Medical physicist
  • 2,510
  • 4
  • 34
  • 51
  • @akrun Yes, but I don't know how to apply fromJSON to each row and append the data to the data frame – Medical physicist Aug 21 '15 at 08:02
  • If you want to extract the numeric part, `library(stringr);do.call(rbind,lapply(str_extract_all(df1$Parameters, '[0-9.]+'), as.numeric))` and name the columns as `A, B1:B4` – akrun Aug 21 '15 at 08:16
  • library(rjson); v = c('{"a":0,"b":[10.2,11.5,22.1]}','{"a":3,"b":[4.0,6.2,-3.3]}'); lapply(v,fromJSON) – Stan Yip Aug 21 '15 at 08:18
  • @akrun Isn't it possible to use fromJSON? It would make possible to extract also string variables. – Medical physicist Aug 21 '15 at 08:21
  • Looks like @galapagos showed one way to do that – akrun Aug 21 '15 at 08:23
  • @akrun I'm trying it, but I have a stupid problem with the format: Error in FUN(c("{\"a\":0,\"b\":[10.2,11.5,22.1]}", : no data to parse – Medical physicist Aug 21 '15 at 08:53
  • 1
    @Medicalphysicist library(rjson); v = c('{"a":0,"b":[10.2,11.5,22.1]}','{"a":3,"b":[4.0,6.2,-3.3]}'); v1 = lapply(v,fromJSON); data.frame(t(sapply(v1,function(y) lapply(y,function(x) paste(x,collapse=','))))) – Stan Yip Aug 21 '15 at 09:46

2 Answers2

5

In your example data, each row contains a json object. This format is called jsonlines aka ndjson, and the jsonlite package has a special function stream_in to parse such data into a data frame:

# Example data
mydata <- data.frame(parameters = c(
  '{"a":0,"b":[10.2,11.5,22.1]}',
  '{"a":3,"b":[4.0,6.2,-3.3]}'
), stringsAsFactors = FALSE)

# Parse json lines
res <- jsonlite::stream_in(textConnection(mydata$parameters))

# Extract columns
a <- res$a
b1 <- sapply(res$b, "[", 1)
b2 <- sapply(res$b, "[", 2)
b3 <- sapply(res$b, "[", 3)

In your example, the json structure is fairly simple so the other suggestions work as well, but this solution will generalize to more complex json structures.

Jeroen Ooms
  • 31,998
  • 35
  • 134
  • 207
  • This is only slightly (4 v 5 seconds on 60K records) faster than constructing a huge JSON string & parsing together (a la `fromJSON(sprintf('{%s}', paste(sprintf('"%s": %s', my_key, my_json_column), collapse = ','))`), is that roughly expected? – MichaelChirico Apr 15 '19 at 11:10
  • 2
    Try generating a json string with 60M records. – Jeroen Ooms Apr 23 '19 at 08:39
0

I actually had a similar problem where I had multiple variables in a data frame which were JSON objects and a lot of them were NA's, but I did not want to remove the rows where NA's existed. I wrote a function which is passed a data frame, id within the data frame(usually a record ID), and the variable name in quotes to parse. The function will create two subsets, one for records which contain JSON objects and another to keep track of NA value records for the same variable then it joins those data frames and joins their combination to the original data frame thereby replacing the former variable. Perhaps it will help you or someone else as it has worked for me in a few cases now. I also haven't really cleaned it up too much so I apologize if my variable names are a bit confusing as well as this was a very ad-hoc function I wrote for work. I also should state that I did use another poster's idea for replacing the former variable with the new variables created from the JSON object. You can find that here : Add (insert) a column between two columns in a data.frame

One last note: there is a package called tidyjson which would've had a simpler solution but apparently cannot work with list type JSON objects. At least that's my interpretation.

library(jsonlite)
library(stringr)
library(dplyr)

parse_var <- function(df,id, var) {
  m <- df[,var]
  p <- m[-which(is.na(m))]
  n <- df[,id]
  key <- n[-which(is.na(df[,var]))]

  #create df for rows which are NA
  key_na <- n[which(is.na(df[,var]))]
  q <- m[which(is.na(m))]
  parse_df_na <- data.frame(key_na,q,stringsAsFactors = FALSE)  

  #Parse JSON values and bind them together into a dataframe.
  p <- lapply(p,function(x){ 
    fromJSON(x) %>% data.frame(stringsAsFactors = FALSE)}) %>% bind_rows()
  #bind the record id's of the JSON values to the above JSON parsed dataframe and name the columns appropriately.
  parse_df <- data.frame(key,p,stringsAsFactors = FALSE)

## The new variables begin with a capital 'x' so I replace those with my former variables  name
  n <- names(parse_df) %>% str_replace('X',paste(var,".",sep = ""))
  n <- n[2:length(n)]
  colnames(parse_df) <- c(id,n)

  #join the dataframe for NA JSON values and the dataframe containing parsed JSON values, then remove the NA column,q.
  parse_df <- merge(parse_df,parse_df_na,by.x = id,by.y = 'key_na',all = TRUE)

#Remove the new column formed by the NA values#
  parse_df <- parse_df[,-which(names(parse_df) =='q')]

  ####Replace variable that is being parsed in dataframe with the new parsed and names values.######

  new_df <- data.frame(append(df,parse_df[,-which(names(parse_df) == id)],after = which(names(df) == var)),stringsAsFactors = FALSE)
  new_df <- new_df[,-which(names(new_df) == var)]
  return(new_df)
} 
Jdev
  • 110
  • 7