Flatten nested list of lists with variable numbers of elements to a data frame

Question

I've got a nested list of lists that I'd like to flatten into a dataframe with id variables so I know which list elements (and sub-list elements) each came from.

> str(gc_all)
List of 3
$ 1: num [1:102, 1:2] -74 -73.5 -73 -72.5 -71.9 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "lon" "lat"
$ 2: num [1:102, 1:2] -74 -73.3 -72.5 -71.8 -71 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "lon" "lat"
$ 3:List of 2
..$ : num [1:37, 1:2] -74 -74.4 -74.8 -75.3 -75.8 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "lon" "lat"
..$ : num [1:65, 1:2] 180 169 163 158 154 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "lon" "lat"

I've used plyr::ldply(mylist, rbind) for flattening lists before, but I seem to be encountering trouble due to variable list lengths: some list elements contain only one dataframe, whilst others contain a list of two dataframes.

I've found a clunky solution using two lapplys and an ifelse like so:

# sample latitude-longitude data
df <- data.frame(source_lat = rep(40.7128, 3),
                 source_lon = rep(-74.0059, 3),
                 dest_lat = c(55.7982, 41.0082, -7.2575),
                 dest_lon = c(37.968, 28.9784, 112.7521),
                 id = 1:3)

# split into list
gc_list <- split(df, df$id)

# get great circles between lat-lon for each id; multiple list elements are outputted when the great circle crosses the dateline
gc_all <- lapply(gc_list, function(x) {
  geosphere::gcIntermediate(x[, c("source_lon", "source_lat")],
                 x[, c("dest_lon", "dest_lat")],
                 n = 100, addStartEnd=TRUE, breakAtDateLine=TRUE)
})

gc_fortified <- lapply(1:length(gc_all), function(i) {
  if(class(gc_all[[i]]) == "list") {
    lapply(1:length(gc_all[[i]]), function(j) {
      data.frame(gc_all[[i]][[j]], id = i, section = j)
    }) %>%
      plyr::rbind.fill()
  } else {
    data.frame(gc_all[[i]], id = i, section = 1)
  }
}) %>%
  plyr::rbind.fill()

But I feel like there must be a more elegant solution that works as a one-liner, e.g. dput, data.table?

Here's what I expect the output to look like:

> gc_fortified %>% 
    group_by(id, section) %>%
    slice(1)

lon      lat    id section
<dbl>    <dbl> <int>   <dbl>
1 -74.0059 40.71280     1       1
2 -74.0059 40.71280     2       1
3 -74.0059 40.71280     3       1
4 180.0000 79.70115     3       2

how about `do.call("rbind.fill", lapply(gc_all, rbind.fill))` ? Assuming your list runs just two levels deep. — RolandASc, Jan 31 '18 at 14:03
Where is your sample data—that can be used for testing? What is your expected output? — 989, Jan 31 '18 at 14:30
@RolandASc I've tried this but it returns the error `arguments imply differing number of rows` — jogall, Jan 31 '18 at 14:37
@989 sample data is already included in the question. `gc_fortified` contains the expected output, but I have added a sample of it to the question anyway. — jogall, Jan 31 '18 at 14:39
you are right, I didn't realize you had matrices. it would have to be `do.call("rbind.fill.matrix", lapply(gc_all, rbind.fill.matrix))` then — RolandASc, Jan 31 '18 at 15:45
`do.call(plyr::rbind.fill.matrix, lapply(gc_all, plyr::rbind.fill.matrix))` seems to work but you're not keeping the item ids. — moodymudskipper, Jan 31 '18 at 15:57

G. Grothendieck · Answer 1 · 2018-01-31T16:19:50.027

I think I prefer the recursive solution already shown but this is one statement of the form do.call("rbind", ...) as requested, if you substitute L and add_n_s into the last line. I have kept them separate here only for clarity.

I have left the result as a matrix since the result is entirely numeric and I suspect that it is not that you prefer data frames but that rbind.fill works on them and that was what you were using. Replace cbind in the add_n_s function with data.frame if you prefer a data frame result.

No packages are used and the solution does not use any indexing.

Here gc_all is transformed to L which is the same except that it is a list of lists and not a list of a mix of matrices and lists. add_n_s takes an element of L and adds n and s columns to it. Finally we Map add_n_s across L and flatten.

Note that if the input had been a list of lists in the first place then L would equal gc_all and the first line would not have been needed.

L <- lapply(gc_all, function(x) if (is.list(x)) x else list(x))

add_n_s <- function(x, n) Map(cbind, x, n = n, s = seq_along(x))
do.call("rbind", do.call("c", Map(add_n_s, L, seq_along(gc_all))))

Update fixed.

Thanks for the answer, I've accepted the `purrr` solution as it satisfies my heretical tidyverse predilection but this is a really nice base solution! — jogall, Jan 31 '18 at 16:26

score 2 · Answer 2 · answered Jan 31 '18 at 14:03

2

I can't offer a one-liner, but you could consider recursion here too

flat <- function(l, s = NULL) {
  lapply(1:length(l), function(i) {
    if (is.list(l[[i]])) {
      do.call(rbind, flat(l[[i]], i))
    } else {
      cbind(l[[i]], id = if (is.null(s)) i else s, section = if (is.null(s)) 1 else i)
    }
  })
}

a <- do.call(rbind, flat(gc_all))
all.equal(data.frame(a), gc_fortified)

[1] TRUE

answered Jan 31 '18 at 14:03

erocoar

5,723
3
23
45

Thanks that's a bit tighter. I'm still hoping for a magical `do.call("rbind.fill", ...)` type of one-liner though! – jogall Jan 31 '18 at 14:43

moodymudskipper · Accepted Answer · 2018-01-31T15:47:48.143

2

First the structure of the list needs to be reworked so it becomes a regular list of lists, then we apply map_dfr two times, using the .id parameter.

library(purrr)
gc_all_df  <- map(map_if(gc_all,~class(.x)=="matrix",list),~map(.x,as.data.frame))
map_dfr(gc_all_df,~map_dfr(.x,identity,.id="id2"),identity,.id="id1")

edited Jan 31 '18 at 15:47

answered Jan 31 '18 at 15:41

moodymudskipper

46,417
11
121
167

Spot on, thanks! I've been meaning to learn `purrrrrrrr` for a while now and this seals the deal – jogall Jan 31 '18 at 16:20
One thing I noticed, it's better to call the function directly without loading the package (i.e. `purrr::map_dfr`) due to a conflict with `ggplot2` (apparently this is quite a common thing) – jogall Jan 31 '18 at 16:22
It's always safer but for tidyverse functions it's quite convenient to load the package... for more fun with purrr you can check my initial answer (see edit history) where I used `purrr::partial` and `purrr::lift_dl` on `dplyr::bind_rows`. These are really cool functions to combine with `map` calls – moodymudskipper Jan 31 '18 at 22:26

Flatten nested list of lists with variable numbers of elements to a data frame

3 Answers3

Linked