Actions to speed up R calculations

Question

I'm asking this as a general/beginner question about R, not specific to the package I was using.

I have a dataframe with 3 million rows and 15 columns. I don't consider this a huge dataframe, but maybe I'm wrong.

I was running the following script and it's been running for 2+ hours - I imagine there must be something I can do to speed this up.

Code:

ddply(orders, .(ClientID), NumOrders=len(OrderID))

This is not an overly intensive script, or again, I don't think it is.

In a database, you could add an index to a table to increase join speed. Is there a similar action in R I should be doing on import to make functions/packages run faster?

See the [data.table](http://cran.r-project.org/web/packages/data.table/) package. — Joshua Ulrich, Jun 06 '12 at 00:28
@JoshuaUlrich data.table instead of dataframe? Are they truly interchangeable? Thanks — mikebmassey, Jun 06 '12 at 00:32
Came in to suggest `data.table` too. This op will be significantly faster and you can run the same bit of code once you convert your `data.frame` to `data.table`. `orders <- data.table(orders)`. That simple. — Maiasaura, Jun 06 '12 at 00:32
Just to explain, **plyr** is extremely popular due to its oh-so-sweet syntactic sugar, but it is slow for large data sets, particularly when the number of groups in your splitting variable is large. Spend some time learning data.table; the syntax isn't as nice (IMHO) but it will often be many orders of magnitude faster. — joran, Jun 06 '12 at 00:36
As a side note the plyr package is a terrific tool (easy syntax) but really not the best tool for larger sets of data and i would consider 3 million obs. a larger set but I'm in education and get excited about data sets of 100. — Tyler Rinker, Jun 06 '12 at 00:36
It's chewing on the `data.table` right now. I'll report back on the performance. @joran if I am looking at aggregation and grouping, is there a better tool to use than `plyr`? Thanks all — mikebmassey, Jun 06 '12 at 00:48
It sounds like `table(orders$ClientID)` might get you want you want too. — Dason, Jun 06 '12 at 00:56
There are lots of questions on aggregation techniques. I put together a (relatively) exhaustive list in this question with timings for popular answers: http://stackoverflow.com/questions/10748253/idiomatic-r-code-for-partitioning-a-vector-by-an-index-and-performing-an-operati/10748470#10748470 — Chase, Jun 06 '12 at 01:12

score 3 · Answer 1 · answered Jun 06 '12 at 01:14

3

Looks to me that you might want:

orders$NumOrders <- with( orders( ave(OrderID  , ClientID) , FUN=length) )

(I'm not aware that len() function exists.)

answered Jun 06 '12 at 01:14

IRTFM

258,963
21
364
487

score 2 · Accepted Answer · answered Jun 06 '12 at 11:39

2

With the suggested data.table package, the following operation should do the job within a second:

orders[,list(NumOrders=length(OrderID)),by=ClientID]

answered Jun 06 '12 at 11:39

Dirk

1,172
3
10
16

score 1 · Answer 3 · answered Jun 06 '12 at 01:08

1

It seems like all your code is doing is this:

orders[order(orders$ClientID), ]

That would be faster.

answered Jun 06 '12 at 01:08

mdsumner

29,099
6
83
91

Actions to speed up R calculations

3 Answers3