3

I'm asking this as a general/beginner question about R, not specific to the package I was using.

I have a dataframe with 3 million rows and 15 columns. I don't consider this a huge dataframe, but maybe I'm wrong.

I was running the following script and it's been running for 2+ hours - I imagine there must be something I can do to speed this up.

Code:

ddply(orders, .(ClientID), NumOrders=len(OrderID))

This is not an overly intensive script, or again, I don't think it is.

In a database, you could add an index to a table to increase join speed. Is there a similar action in R I should be doing on import to make functions/packages run faster?

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
mikebmassey
  • 8,354
  • 26
  • 70
  • 95
  • 4
    See the [data.table](http://cran.r-project.org/web/packages/data.table/) package. – Joshua Ulrich Jun 06 '12 at 00:28
  • @JoshuaUlrich data.table instead of dataframe? Are they truly interchangeable? Thanks – mikebmassey Jun 06 '12 at 00:32
  • Came in to suggest `data.table` too. This op will be significantly faster and you can run the same bit of code once you convert your `data.frame` to `data.table`. `orders <- data.table(orders)`. That simple. – Maiasaura Jun 06 '12 at 00:32
  • 3
    Just to explain, **plyr** is extremely popular due to its oh-so-sweet syntactic sugar, but it is slow for large data sets, particularly when the number of groups in your splitting variable is large. Spend some time learning data.table; the syntax isn't as nice (IMHO) but it will often be many orders of magnitude faster. – joran Jun 06 '12 at 00:36
  • As a side note the plyr package is a terrific tool (easy syntax) but really not the best tool for larger sets of data and i would consider 3 million obs. a larger set but I'm in education and get excited about data sets of 100. – Tyler Rinker Jun 06 '12 at 00:36
  • It's chewing on the `data.table` right now. I'll report back on the performance. @joran if I am looking at aggregation and grouping, is there a better tool to use than `plyr`? Thanks all – mikebmassey Jun 06 '12 at 00:48
  • 1
    It sounds like `table(orders$ClientID)` might get you want you want too. – Dason Jun 06 '12 at 00:56
  • There are lots of questions on aggregation techniques. I put together a (relatively) exhaustive list in this question with timings for popular answers: http://stackoverflow.com/questions/10748253/idiomatic-r-code-for-partitioning-a-vector-by-an-index-and-performing-an-operati/10748470#10748470 – Chase Jun 06 '12 at 01:12

3 Answers3

3

Looks to me that you might want:

orders$NumOrders <- with( orders( ave(OrderID  , ClientID) , FUN=length) )

(I'm not aware that len() function exists.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
2

With the suggested data.table package, the following operation should do the job within a second:

orders[,list(NumOrders=length(OrderID)),by=ClientID]
Dirk
  • 1,172
  • 3
  • 10
  • 16
1

It seems like all your code is doing is this:

orders[order(orders$ClientID), ]

That would be faster.

mdsumner
  • 29,099
  • 6
  • 83
  • 91