2

A colleague is sending me an Elasticsearch query result (100000's of records, hundreds of attributes) that looks like:

pets_json <- paste0('[{"animal":"cat","attributes":{"intelligence":"medium","noises":[{"noise":"meow","code":4},{"noise":"hiss","code":2}]}},',
                     '{"animal":"dog","attributes":{"intelligence":"high","noises":{"noise":"bark","code":1}}},',
                     '{"animal":"snake","attributes":{"intelligence":"low","noises":{"noise":"hiss","code":2}}}]')

There is a redundant key, code, that I do not need to capture.

I would like to produce a data.frame that looks something like:

animal  intelligence  noises.bark noises.hiss noises.meow   
   cat        medium            0           1           1 
   dog          high            1           0           0
 snake           low            0           1           0

I can read in the json, but flatten=TRUE doesn't completely flatten:

library(jsonlite)
str(df <- fromJSON(txt=pets_json, flatten=TRUE))
# 'data.frame': 3 obs. of  3 variables:
#   $ animal                 : chr  "cat" "dog" "snake"
# $ attributes.intelligence: chr  "medium" "high" "low"
# $ attributes.noises      :List of 3
# ..$ :'data.frame':    2 obs. of  2 variables: \
#   .. ..$ noise   : chr  "meow" "hiss"         \
# .. ..$ code: int  4 2                          |
# ..$ :List of 2                                 |
# .. ..$ noise   : chr "bark"                    |- need to remove code and flatten    
# .. ..$ code: int 1                             |
# ..$ :List of 2                                 |
# .. ..$ noise   : chr "hiss"                   /
# .. ..$ code: int 2                           /

Because the flattening is incomplete I can use this intermediate stage to get rid of the code unwanted keys before calling another flatten(), but the only way I know to get rid of the keys is really slow:

for( l in which(sapply(df, is.list)) ){
  for( l2 in which(sapply(df[[l]], is.list))){
    df[[l]][[l2]]['code'] <- NULL
  }
}
( df <- data.frame(flatten(df)) )
#   animal attributes.intelligence attributes.noises
# 1    cat                  medium        meow, hiss
# 2    dog                    high              bark
# 3  snake                     low              hiss

And then after that...? I know that using tidyr::separate I can probably come up with a hacky way to spread the noise values into columns and set flags. But that works for one attribute at a time, and I have possibly hundreds of these. I do not know all the possible attribute values in advance.

How can I efficiently produce the desired data.frame? Thanks for your time!

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134

2 Answers2

2

I don't think there's a super simple way to get it in the right format, but here's an attempt:

out <- fromJSON(pets_json)

# drop the "code" data and do some initial cleaning
out$noises <- lapply(
  out$attributes$noises, 
  function(x) unlist(x[-match("code",names(x))]) 
)

# extract the key part of the intelligence variable
out$intelligence <- out$attributes$intelligence

# set up a vector of all possible noises
unq_noises <- unique(unlist(out$noises)) 

# make the new separate noise variables
out[unq_noises] <- t(vapply(
  out$noises, 
  function(x) unq_noises %in% x,
  FUN.VALUE=logical(length(out$noises)))
)

# clean up no longer needed variables
out[c("attributes","noises")] <- list(NULL)

out

#  animal intelligence  meow  hiss  bark
#1    cat       medium  TRUE  TRUE FALSE
#2    dog         high FALSE FALSE  TRUE
#3  snake          low FALSE  TRUE FALSE
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thanks much! Let me see if I can generalize this to a large number of attributes (noise being one). If it works, I will accept -- but I might not get to it until Monday. – C8H10N4O2 Oct 24 '15 at 05:00
2

Basic case with magrittr and data.table

Here's another proposal combining magrittr and data.table for extra zeitgeist brownie points:

# Do not simplify to data.frame
str(df <- fromJSON(txt=pets_json, simplifyDataFrame=F))

# The %<>% operator create a pipe and assigns back to the variable
df %<>% 
  lapply(. %>%
    data.table(animal = .$animal, 
               intelligence = .$attributes$intelligence, 
               noises = unlist(.$attributes$noises)) %>% # Create a data.table
    .[!noises %in% as.character(0:9)] ) %>% # Remove numeric values
  rbindlist %>% # Combine into a single data.table
  dcast(animal + intelligence ~ paste0("noises.", noises), # Cast the noises variables
        value.var = "noises", 
        fill = 0, # Put 0 instead of NA
        fun.aggregate = function(x) 1) # Put 1 instead of noise

The end format matches your request:

df
#    animal intelligence noises.bark noises.hiss noises.meow
# 1:    cat       medium           0           1           1
# 2:    dog         high           1           0           0
# 3:  snake          low           0           1           0

For multiple attributes

Now, you seem to want a generalization to multiple attributes. Suppose your data also has a colors attribute, e.g.:

pets_json <- paste0('[{"animal":"cat","attributes":{"intelligence":"medium","noises":[{"noise":"meow","code":4},{"noise":"hiss","code":2}],"colors":[{"color":"brown","code":4},{"color":"white","code":2}]}},',
                    '{"animal":"dog","attributes":{"intelligence":"high","noises":{"noise":"bark","code":1},"colors":{"color":"brown","code":4}}},',
                    '{"animal":"snake","attributes":{"intelligence":"low","noises":{"noise":"hiss","code":2},"colors":[{"color":"green","code":4},{"color":"brown","code":4}]}}]')

Then you can follow this generalized code, which is rather ugly but should work fine:

# Do not simplify to data.frame
str(df <- fromJSON(txt=pets_json, simplifyDataFrame=F))

# Set up the attributes names
attr.names <- c("noises", "colors")

# The %<>% operator create a pipe and assigns back to the variable
df %<>% 
  lapply(function(.)
    eval(parse(text=paste0(
      "data.table(animal = .$animal, ",
      "intelligence = .$attributes$intelligence, ", 
      paste0(attr.names, " = unlist(.$attributes$", attr.names, ")", collapse=", "), 
      ")")))
    %>%
      .[eval(parse(text=paste("!", attr.names, "%in% as.character(0:9)", collapse = " & ")))] ) %>%
  rbindlist 

# Cast each variable and merge together
df <- dcast(melt(df, measure.vars=c(attr.names)), 
        animal + intelligence ~ variable + value, sep=".")

#    animal intelligence noises.bark noises.hiss noises.meow colors.brown
# 1:    cat       medium           0           1           1            1
# 2:    dog         high           1           0           0            1
# 3:  snake          low           0           1           0            1
#    colors.green colors.white
# 1:            0            1
# 2:            0            0
# 3:            1            0

This solution should also work for a single attribute, e.g. attr.names <- c("noises").

cocquemas
  • 1,149
  • 8
  • 17
  • This is really outstanding, you are clearly a rising star on this tag. I notice that `?dcast` `can now cast multiple value.var columns`. I am looking into this and to see if there is a way to keep the variable names as `noises.bark`, `color.brown`, etc. – C8H10N4O2 Oct 26 '15 at 21:25
  • @C8H10N4O2 I fixed the second part, I had overlooked that bit, sorry. As for using multiple dcast attributes, that was my first idea but I don't think it's meant for this kind of problem. You can try for yourself: `dcast(df, as.formula(paste0("animal + intelligence ~ paste0(attr.names, '.', ", attr.names, ")")), value.var = attr.names, fill = 0, fun.aggregate = function(x) 1)` gives really strange results. – cocquemas Oct 27 '15 at 03:51
  • This has been a great help. Thank you. – C8H10N4O2 Oct 27 '15 at 21:16
  • 1
    @C8H10N4O2 Just thought of a much nicer solution for the last bit, using `melt` and `dcast` again. So much more elegant! – cocquemas Oct 28 '15 at 02:43