4

I need to efficiently "unroll" a list column in an R data.frame. For example, if I have a data.frame defined as:

dbt <- data.frame(values=c(1,1,1,1,2,3,4), 
                  parm1=c("A","B","C","A","B","C","B"),
                  parm2=c("d","d","a","b","c","a","a"))

Then, assume an analysis that generates one column as a list, similar to the following output:

agg <- aggregate(values ~ parm1 + parm2, data=dbt, 
                 FUN=function(x) {return(list(x))})

The aggregated data.frame looks like (where class(agg$values) == "list"):

  parm1 parm2 values
1     B     a      4
2     C     a   1, 3
3     A     b      1
4     B     c      2
5     A     d      1
6     B     d      1

I'd like to unroll the "values" column , repeating the parm1 & 2 values (adding more rows) in an efficient manner for each element of the list over all the data.frame rows.

At the top level I wrote a function that does the unroll in a for loop called in an apply. It's really inefficient, (the aggregated data.frame takes about an hour to create and nearly 24 hours to unroll, the fully unrolled data has ~500k records). The top level I'm using is:

unrolled.data <- do.call(rbind, apply(agg, 1, FUN=unroll.data))

The function just calls unlist() on the value column object then builds a data.frame object in a for loop as the returned object.

The environment is somewhat restricted and the tidyr, data.table and splitstackshape libraries are unavailable to me, it needs to not only be functions found in base:: but limited to those available in v3.1.1 and before. Thus the answers in this (not really a duplicate) question do not apply.

Any suggestions on something faster?

Thanks!

Community
  • 1
  • 1
TimH
  • 51
  • 4
  • Thanks, but the tidyr library is unavailable in this circumstance. Other thoughts? – TimH Jul 22 '15 at 00:31
  • Unfortunately nor are splitstackshape or data.table (thanks for the reference issue, I hadn't seen it). – TimH Jul 22 '15 at 00:34

1 Answers1

3

With base R, you could try

with(agg, {
    data.frame(
        lapply(agg[,1:2], rep, times=lengths(values)),
        values=unlist(values)
    )
})
#      parm1 parm2 values
# 1.2      B     a      4
# 1.31     C     a      1
# 1.32     C     a      3
# 2.1      A     b      1
# 3.2      B     c      2
# 4.1      A     d      1
# 4.2      B     d      1

Timings for an alternative (thanks @thelatemail)

library(dplyr)
agg %>%
  sample_n(1e7, replace=T) -> bigger

system.time(
    with(bigger, { data.frame(lapply(bigger[,1:2], rep, times=lengths(values)), values=unlist(values)) })
)
# user  system elapsed 
# 3.78    0.14    3.93 

system.time(
    with(bigger, { data.frame(bigger[rep(rownames(bigger), lengths(values)), 1:2], values=unlist(values)) })
)
# user  system elapsed 
# 11.30    0.34   11.64 
Rorschach
  • 31,301
  • 5
  • 78
  • 129
  • 1
    You could replace the middle `lapply` bit with a row indexing call - `agg[rep(rownames(agg),lengths(values)),1:2]` , which should be quicker. – thelatemail Jul 22 '15 at 00:54
  • @thelatemail not sure why, but it actually seems to be a bit slower. Would it be applying the `[` function many times? – Rorschach Jul 22 '15 at 01:19
  • It shouldn't be, the `[` call should only happen once, as should the `rep` call. Maybe the subsetting is just slower than multiple calls to each column. Seems odd to me. – thelatemail Jul 22 '15 at 01:22
  • Of course, the overhead of `[` will be proportionally less if there are many more than 2 variables to loop over with `lapply`. For this example, what you had originally is probably best. – thelatemail Jul 22 '15 at 04:11
  • This looks like it could work but I'm having a couple of issues. The function lengths isn't in my build. Then, if I use length instead, there's a mismatch in row count "arguments imply differing number of rows: 36, 7". – TimH Jul 22 '15 at 22:08
  • 1
    @TimH `lengths` is introduced in R-3.2+, the old way would be `sapply(values, length)`. – Rorschach Jul 22 '15 at 22:13
  • That explains it, I'm forced to stay with 3.1.1 for now. The sapply variant did the trick, takes less than 7 seconds to handle the complete object. – TimH Jul 23 '15 at 02:07