Why such large differences in performance between R's by() and lapply()?

Question

I have an xts object containing time series for multiple stock symbols. I need to split the xts object in symbol-specific subgroups and process the data for each symbol, then reassemble all the subgroups in the original xts matrix containing the full set of rows. Each symbol is a field between 1 and 4 characters that it's used as the factor index to split the matrix in subgroups.

These are the time reported to split my matrix when calling by(), lapply() and ddply():

> dim(ets)
[1] 442750     24
> head(ets)
                    Symbol DaySec  ExchTm             LclTm              Open      High      Low       Close     CloseRet       
2011-07-22 09:35:00 "AA"   "34500" "09:34:54.697.094" "09:34:54.697.052" " 158100" " 158400" " 157900" " 158200" " 6.325111e-04"
2011-07-22 09:35:00 "AAPL" "34500" "09:34:59.681.827" "09:34:59.681.797" "3899200" "3899200" "3892200" "3894400" "-1.231022e-03"
2011-07-22 09:35:00 "ABC"  "34500" "09:34:49.805.994" "09:34:49.806.008" " 400100" " 401800" " 400100" " 401600" " 3.749063e-03"
2011-07-22 09:35:00 "ALL"  "34500" "09:34:59.009.001" "09:34:59.008.810" " 285500" " 285500" " 285300" " 285300" "-7.005254e-04"
2011-07-22 09:35:00 "AMAT" "34500" "09:34:59.982.447" "09:34:59.982.423" " 130200" " 130500" " 130200" " 130500" " 2.304147e-03"
2011-07-22 09:35:00 "AMZN" "34500" "09:34:48.012.576" "09:34:48.012.565" "2137400" "2139100" "2137400" "2139100" " 7.953588e-04"
... (15 more columns)
> system.time(by(ets, ets$Symbol, function(x) { return(x) }))
   user  system elapsed 
 78.725   0.932  79.735
> system.time(ddply(as.data.frame(ets), "Symbol", function(x) { return (x) }))
   user  system elapsed 
100.590   0.416 101.105 
> system.time(lapply(split.default(ets, ets$Symbol), function(x) { return(x) }))
   user  system elapsed 
  1.572   0.280   1.853

More information on working with data frame and matrix subgroups are available in this excellent blog post.

Why is there such a large difference in performances when using lapply/split.default?

Please provide a small sample of `myxts`. I don't use `by` and there's probably a way to do what you want with the standard xts tools, but it's hard to know without a reproducible example. — Joshua Ulrich, Jan 17 '12 at 20:21
Yes, `head(myxts)` would be too big. How about `myxts[X:Y,1:5]` where rows `X:Y` contain more than one value for `Symbol`... and how does `summary` tell you anything useful for characters? — Joshua Ulrich, Jan 17 '12 at 21:07
I don't think I understand what you're trying to do, but try calling something like `lapply(split.default(myxts, myxts$Symbol), str)`. — Joshua Ulrich, Jan 17 '12 at 21:27
@Robert If you need fast grouping perhaps try [data.table](http://datatable.r-forge.r-project.org/) — Matt Dowle, Jan 18 '12 at 09:50

score 0 · Answer 1 · answered Jan 17 '12 at 21:11

Working in numeric mode greatly reduce the processing time:

> system.time(by(myxts[,c(1,2,3,4,5)], myxts$Symbol, summary))
   user  system elapsed 
 57.768   0.688  58.511 
> system.time(by(myxts[,c(1,2,3,4,5,6,7,8)], myxts$Symbol, summary))
   user  system elapsed 
  62.284   0.620  62.971 
> system.time(by(myxts[,c(1,2,3,4,5,6,7,8, 9, 10, 11, 12)], myxts$Symbol, summary))
    user  system elapsed 
 76.529   0.632  77.232 
> myxts.numeric = myxts
> mode(myxts.numeric) = "numeric"
Warning message:
In as.double.xts(c("AA", "AAPL", "ABC", "ALL", "AMAT", "AMZN", "BAC",  :
  NAs introduced by coercion
> system.time(by(myxts.numeric[,c(1,2,3,4,5,6,7,8, 9, 10, 11, 12)], myxts$Symbol, summary))
   user  system elapsed 
  4.948   0.688   5.642

Why such large differences in performance between R's by() and lapply()?

1 Answers1