0

I'm pulling soccer data through an API - the resulting JSON is returned as a list; dput example below:

list(list(id = 10332894L, league_id = 8L, season_id = 12962L, 
aggregate_id = NULL, venue_id = 201L, localteam_id = 51L, 
visitorteam_id = 27L, weather_report = list(code = "drizzle", 
    temperature = list(temp = 53.92, unit = "fahrenheit"), 
    clouds = "90%", humidity = "87%", wind = list(speed = "12.75 m/s", 
        degree = 200L)), attendance = 25098L, leg = "1/1", 
deleted = FALSE, referee = list(data = list(id = 15267L, 
    common_name = "L. Probert", fullname = "Lee Probert", 
    firstname = "Lee", lastname = "Probert"))), list(id = 10332895L, 
league_id = 8L, season_id = 12962L, aggregate_id = NULL, 
venue_id = 340L, localteam_id = 251L, visitorteam_id = 78L, 
weather_report = list(code = "drizzle", temperature = list(
    temp = 50.07, unit = "fahrenheit"), clouds = "90%", humidity = "93%", 
    wind = list(speed = "6.93 m/s", degree = 160L)), attendance = 22973L, 
leg = "1/1", deleted = FALSE, referee = list(data = list(
    id = 15273L, common_name = "M. Oliver", fullname = "Michael Oliver", 
    firstname = "Michael", lastname = "Oliver"))))

I'm extracting using a for loop at the moment - the reprex shows 2 top level list items when there are hundreds in the full data. The main drawback of using a loop is that there are sometimes missing values which cause the loop to stop. I'd like to move this to purrr but am struggling to extract 2nd level nested items using at_depth or modify_depth. There are also nests inside nests which really adds to the complexity.

The end-state should be a tidy data frame - from this data the df will only have 2 rows but will have many columns each representing an item, no matter where that item is nested in this list. If something's missing then it should be an NA value.

The ideal scenario for a solution, even though it may be inelegant is that there's a data frame per level / nested item produced that can then be bound together later.

thanks.

nycrefugee
  • 1,629
  • 1
  • 10
  • 23

1 Answers1

1

Step1: Replace NULL with NA using community wiki's function here

simple_rapply <- function(x, fn)
{
  if(is.list(x))
  {
    lapply(x, simple_rapply, fn)
  } else
  {
    fn(x)
  }
}    
non.null.l <- simple_rapply(l, function(x) if(is.null(x)) NA else x)

Step2:

library(purrr)
map_df(map(non.null.l,unlist),bind_rows)
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
  • Thanks - that did a really good job, but some of the items in the nested-sub-lists weren't extracted. So if you look at the sub-list `my_list[[1]][["weather_report"]]` , there's a column been created for `weather` but the value is from `my_list[[1]][["weather_report"]][["code"]]` and the further sublists aren't unpacked, nor are the items included at the same level as `code`. Any ideas? Thanks again. – nycrefugee Jan 02 '19 at 19:26
  • 1
    @nycrefugee it's my pleasure to help. I think foldel's procedure designed for two levels list, not for a multilevel list and that what caused the issue in the final df. Please check if there is anything went wrong again. Finally, we take this path because of `aggregate_id` as its `NULL` and `unlist` will drop it, if you don't need it then you can do step2 directly. Thanks – A. Suliman Jan 02 '19 at 20:23