2

I have a dataframe that has one column which contains json data. I want to extract some attributes from this json data into named columns of the data frame.

Sample data

json_col = c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}')
id = c(1,2,3)
df <- data.frame(id, json_col)

I was able to achieve this using

library(tidyverse)
library(jsonlite)

extract_json_attr <- function(from, attr, default=NA) {
  value <- from %>% 
             as.character() %>% 
             jsonlite::fromJSON(txt = .) %>%
             .[attr]

  return(ifelse(is.null(value[[1]]), default, value[[1]]))
}

df <- df %>% 
        rowwise() %>%
        mutate(name = extract_json_attr(json_col, "name"),
               points = extract_json_attr(json_col, "points", 0))

In this case the extract_json_attr needs to parse the json column multiple times for each attribute to be extracted.

Is there a better way to extract all attributes at one shot?

I tried this function to return multiple values as a list, but I am not able to use it with mutate to set multiple columns.

extract_multiple <- function(from, attributes){
  values <- from %>% 
             as.character() %>% 
             jsonlite::fromJSON(txt = .) %>%
             .[attributes]
  return (values)
} 

I am able to extract the desired values using this function

extract_multiple(df$json_col[1],c('name','points'))
extract_multiple(df$json_col[2],c('name','points')) 

But cannot apply this to set multiple columns in a single go. Is there a better way to do this efficiently?

Jerry Thomas
  • 305
  • 3
  • 7

2 Answers2

2

Here is one way using bind_rows from dplyr

dplyr::bind_rows(lapply(as.character(df$json_col), jsonlite::fromJSON))

# A tibble: 3 x 2
#  name  points
#  <chr>  <int>
#1 john      NA
#2 doe       10
#3 jane      20

To subset specific attribute from the function, we can do

bind_rows(lapply(as.character(df$json_col), function(x) 
          jsonlite::fromJSON(x)[c('name', 'points')]))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Wow! this is a simple solution. Thanks. Can we select specific attributes to extract? My actual dataset is huge and I want to extract only a subset of the attributes in the json. – Jerry Thomas Dec 02 '19 at 06:06
  • @JerryThomas You can subset after doing `jsonlite::fromJSON` as well but wouldn't selecting the columns which you need with `select` more simple ? Like `bind_rows(lapply(as.character(df$json_col), jsonlite::fromJSON)) %>% select(name)` – Ronak Shah Dec 02 '19 at 06:08
  • One of the problems I faced with this approach was that different rows had attributes with the same name but different type. For example, `points` was an array in some rows and for other rows, it was a numeric value. When this process is run it throws an error. I would like the ability to filter out offending data attributes before the conversion instead of after. It should also reduce the time taken for conversion. – Jerry Thomas Dec 02 '19 at 09:53
  • @JerryThomas I see. Okay I have updated the answer to subset specific attributes. – Ronak Shah Dec 02 '19 at 10:09
0

On the R4DS slack channel I received an alternative approach for handling json arrays as columns. Using that, I found another approach that seems to work better on larger datasets.

library(tidyverse)
library(jsonlite)

extract <- function(input, fields){
    json_df <- fromJSON(txt=input)
    missing <- setdiff(fields, names(json_df))
    json_df[missing] <- NA

    return (json_df %>% select(fields))
}


df <- data.frame(id=c(1,2,3),
                 json_col=c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}'),
                 stringsAsFactors=FALSE)
df %>%
  mutate(json_col = paste0('[',json_col,']'),
         json_col = map(json_col, function(x) extract(input=x, fields=c('name', 'points')))) %>%
  unnest(cols=c(json_col))
Jerry Thomas
  • 305
  • 3
  • 7