19

I have a number of operations on data frames which I would like to speed up using mclapply() or other lapply() like functions. One of the easiest ways for me to wrestle with this is to make each row of the data frame a small data frame in a list. I can do this pretty easily with plyr like this:

df <- data.frame( a=rnorm(1e4), b=rnorm(1e4))
require(plyr)
system.time(myList <- alply( df, 1, function(x) data.frame(x) ))

Once I have my data as a list I can easily do things like:

mclapply( myList, function(x) doSomething(x$a) )

This works swimmingly, but I have quite a lot of data and the adply() step is quite slow. I tried using the multicore parallel backend on the adply step, but it never used more than one processor even though I had registered 8. I'm suspicious the parallel option may not work with this type of problem.

Any tips on how to make this faster? Maybe a base R solution?

JD Long
  • 59,675
  • 58
  • 202
  • 294

2 Answers2

17

Just use split. It's a few times faster than your adply line.

> system.time(myList <- alply( df, 1, function(x) data.frame(x) ))
   user  system elapsed 
   7.53    0.00    7.57 
> system.time( splitList <- split(df, 1:NROW(df)) )
   user  system elapsed 
   1.73    0.00    1.74 
> 

I suspect the parallel backend on adply is only for function evaluation (not splitting and re-combining).

UPDATE:
If you can convert your data.frame to a matrix, the solution below will be über-fast. You may be able to use split, but it will drop names and return a vector in each list element.

> m <- as.matrix(df)
> system.time( matrixList <- lapply(1:NROW(m), function(i) m[i,,drop=FALSE]) )
   user  system elapsed 
   0.02    0.00    0.02
> str(matrixList[[1]])
 num [1, 1:2] -0.0956 -1.5887
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> system.time( matrixSplitList <- split(m, 1:NROW(m)) )
   user  system elapsed 
   0.01    0.00    0.02 
> str(matrixSplitList[[1]])
 num [1:2] -0.0956 -1.5887
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • I think your conclusion about the splitting and combining is correct. – JD Long Feb 24 '11 at 22:01
  • clearly my lesson for today is, "everything is faster on matrices" – JD Long Feb 24 '11 at 22:31
  • A fairer comparison is `system.time(myList <- alply( df, 1, identity ))` but it still takes too long :( – hadley Feb 25 '11 at 22:57
  • It should be possible to make it as fast as split, if only R had a fast function to subscript an object by an arbitrary number of dimensions. – hadley Feb 25 '11 at 23:23
6

How about this?

jdList <- split(df, 1:nrow(df))

> class(jdList[[1]])
[1] "data.frame"

> system.time(jdList <- split(df, 1:nrow(df)))
   user  system elapsed 
   1.67    0.02    1.70 
> system.time(myList <- alply( df, 1, function(x) data.frame(x) ))
   user  system elapsed 
    7.2     0.0     7.3 
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197