0

I'm trying to wrangle a json file into a data frame but I'm having trouble when unnesting to a wider format because the columns contain lists of unequal sizes. In addition, it appears that one column of lists contains what should be the name of the new column and the other column contains the data.

I apologize for not including a repex but I don't know how to create a data frame with lists. I've included a screen shot to hopefully highlight the issue.

nested data frame

When I attempt to unnest with raw_json2 %>% unnest_wider(value) I get the following error: Error: Can't combine `..1$...1` <character> and `..27$...1` <list>.

It appears that the column type contains what should be the column names, and that 'value' contains the values for those columns. So essentially columns would be added to my data frame from type and data added from value.

EDIT------------------------------ Sample data. I put the first 30 as there some tricky lists at the end.

> dput(head(raw_json2, n=30))

structure(list(id = c(112357710L, 112357713L, 112357714L, 112357717L, 
112357719L, 112357723L, 112357727L, 112357730L, 112357732L, 112357736L, 
112357737L, 112357738L, 112357744L, 112357745L, 112357746L, 112357747L, 
112357759L, 112357760L, 112357761L, 112357764L, 112357765L, 112357766L, 
112357767L, 112357775L, 112357777L, 112357780L, 112357782L, 112357783L, 
112357784L, 112357791L), start_sec = c("00:00:19.000000", "00:45:34.000000", 
"00:02:19.000000", "00:00:49.000000", "00:00:46.000000", "00:00:59.000000", 
"00:01:17.000000", "00:01:29.000000", "00:01:43.000000", "00:02:02.000000", 
"00:02:04.000000", "00:02:06.000000", "00:02:14.000000", "00:02:36.000000", 
"00:02:22.000000", "00:02:46.000000", "00:02:52.000000", "00:02:48.000000", 
"00:03:00.000000", "00:03:02.000000", "00:03:10.000000", "00:03:12.000000", 
"00:04:01.000000", "00:03:27.000000", "00:04:15.000000", "00:03:53.000000", 
"00:04:00.000000", "00:04:03.000000", "00:05:05.000000", "00:05:06.000000"
), end_sec = c("00:00:19.000000", "00:45:34.000000", "00:02:19.000000", 
"00:00:49.000000", "00:00:46.000000", "00:00:59.000000", "00:01:17.000000", 
"00:01:29.000000", "00:01:43.000000", "00:02:02.000000", "00:02:04.000000", 
"00:02:06.000000", "00:02:14.000000", "00:02:36.000000", "00:02:22.000000", 
"00:02:46.000000", "00:02:52.000000", "00:02:48.000000", "00:03:00.000000", 
"00:03:02.000000", "00:03:10.000000", "00:03:12.000000", "00:04:01.000000", 
"00:03:27.000000", "00:04:15.000000", "00:03:53.000000", "00:04:00.000000", 
"00:04:03.000000", "00:05:05.000000", "00:05:06.000000"), type = list(
    c("teamNames", "list"), "list", "teamNames", "teamNames", 
    c("teamNames", "list"), "teamNames", "teamNames", "teamNames", 
    "teamNames", c("teamNames", "list"), "teamNames", "teamNames", 
    "teamNames", "teamNames", "teamNames", "teamNames", "teamNames", 
    "teamNames", "teamNames", "teamNames", "teamNames", "teamNames", 
    "teamNames", c("teamNames", "list"), "teamNames", c("teamNames", 
    "list"), c("teamNames", "list", "chartPoint", "chartPoint"
    ), "teamNames", "teamNames", "teamNames"), value = list(c("Real Madrid", 
"kickoff"), "1", "Real Madrid", "Real Madrid", c("Real Madrid", 
"throw in"), "Real Madrid", "Barcelona", "Real Madrid", "Barcelona", 
    c("Real Madrid", "free kick"), "Real Madrid", "Real Madrid", 
    "Real Madrid", "Real Madrid", "Barcelona", "Barcelona", "Barcelona", 
    "Real Madrid", "Real Madrid", "Barcelona", "Barcelona", "Real Madrid", 
    "Real Madrid", c("Real Madrid", "throw in"), "Real Madrid", 
    c("Real Madrid", "corner kick"), list("Real Madrid", "save", 
        list(x = 483.51837158203, y = 397.89303588867, x2 = 0L, 
            y2 = 0L, type = "point", sector = 1L, orientation = "left"), 
        list(x = 274.94967651367, y = 404.6828918457, x2 = 0L, 
            y2 = 0L, type = "point", sector = 3L, orientation = "left")), 
    "Barcelona", "Barcelona", "Barcelona")), row.names = c(NA, 
-30L), class = c("tbl_df", "tbl", "data.frame"))
seansteele
  • 619
  • 3
  • 10
  • Can you provide a sample of your data via `dput(head(raw_json2))`? – AndrewGB Apr 29 '22 at 04:04
  • 1
    Do you just want to unnest the first level, i.e. get 4 colums for the teamname, the second column (list) and then Chartpoint, chartpoint. Or do you want to completely unnest to the bottom, i.e. in row 27, unnest the inner list elements as well? – deschen Apr 29 '22 at 07:17
  • @seansteele, any news from you in terms of the question asked above in the comments or the provided solution? – deschen May 02 '22 at 17:56
  • @deschen ultimately I would like to unnest the inner list elements as well. I used your answer to unnest the new value_4 and it worked on my initial dataset... but when I tried it with a new .json file I'm receiving errors. Trying to pinpoint what is different between the initial .json file and the new .json file. – seansteele May 09 '22 at 17:32
  • Can you provide an example of what your desired output should look like? If you want to entirely unnest all list elements, an approach with rrapply might do the trick. – deschen May 09 '22 at 17:37
  • At this point just looking for an entire unnest and then I can wrangle the appropriate columns from there. – seansteele May 09 '22 at 19:25

1 Answers1

0

Here's one way. The tricky part indeed is that a) your value column has list elements of different length and b) the names stored in type also differ per row, soe we can't use them for naming the new unnested columns (hence I decided for a generic naming):

df |>
  mutate(value2 = map2(value, max(lengths(type)), ~ `length<-`(.x, .y)),
         value2 = map(value2, as.list),
         value2 = map(value2, setNames, paste0("value_", 1:4))) |> 
  unnest_wider(value2)

What we do here is:

  • First we need to make sure that all your list elements have the same length, which is the maximum of all possible lenghts (in your case 4). Inspiration is taken from (how else could it be) akrun's solution here: https://stackoverflow.com/a/57152605/2725773 (give him credit for it!)
  • We then need to convert each row of value2 into a list, because they are of different type (e.g. row 27 again has lists, where most other columns don't).
  • We then give each of the 4 list columns a name.
  • Then we can unnest_wider

The result (showing rows 1:5 and 26:30):

# A tibble: 10 × 9
          id start_sec       end_sec         type      value      value_1     value_2     value_3          value_4         
       <int> <chr>           <chr>           <list>    <list>     <chr>       <chr>       <list>           <list>          
 1 112357710 00:00:19.000000 00:00:19.000000 <chr [2]> <chr [2]>  Real Madrid kickoff     <chr [1]>        <chr [1]>       
 2 112357713 00:45:34.000000 00:45:34.000000 <chr [1]> <chr [1]>  1           NA          <chr [1]>        <chr [1]>       
 3 112357714 00:02:19.000000 00:02:19.000000 <chr [1]> <chr [1]>  Real Madrid NA          <chr [1]>        <chr [1]>       
 4 112357717 00:00:49.000000 00:00:49.000000 <chr [1]> <chr [1]>  Real Madrid NA          <chr [1]>        <chr [1]>       
 5 112357719 00:00:46.000000 00:00:46.000000 <chr [2]> <chr [2]>  Real Madrid throw in    <chr [1]>        <chr [1]>       
 6 112357780 00:03:53.000000 00:03:53.000000 <chr [2]> <chr [2]>  Real Madrid corner kick <chr [1]>        <chr [1]>       
 7 112357782 00:04:00.000000 00:04:00.000000 <chr [4]> <list [4]> Real Madrid save        <named list [7]> <named list [7]>
 8 112357783 00:04:03.000000 00:04:03.000000 <chr [1]> <chr [1]>  Barcelona   NA          <chr [1]>        <chr [1]>       
 9 112357784 00:05:05.000000 00:05:05.000000 <chr [1]> <chr [1]>  Barcelona   NA          <chr [1]>        <chr [1]>       
10 112357791 00:05:06.000000 00:05:06.000000 <chr [1]> <chr [1]>  Barcelona   NA          <chr [1]>        <chr [1]>  
deschen
  • 10,012
  • 3
  • 27
  • 50
  • Thanks. This answered my original question so I've marked as solved. I am working on a new question to address several lists of unequal sizes. – seansteele May 10 '22 at 00:07