4

I have a fairly large dataset (by my standards) and I want to create a sequence number for blocks of records. I can use the plyr package, but the execution time is very slow. The code below replicates a comparable size dataframe.

## simulate an example of the size of a normal data frame
N <- 30000
id <- sample(1:17000, N, replace=T)
term <- as.character(sample(c(9:12), N, replace=T))
date <- sample(seq(as.Date("2012-08-01"), Sys.Date(), by="day"), N, replace=T)
char <- data.frame(matrix(sample(LETTERS, N*50, replace=T), N, 50))
val <- data.frame(matrix(rnorm(N*50), N, 50))
df <- data.frame(id, term, date, char, val, stringsAsFactors=F)
dim(df)

In reality, this is a little smaller than what I work with, as the values are typically larger...but this is close enough.

Here is the execution time on my machine:

> system.time(test.plyr <- ddply(df, 
+                                .(id, term), 
+                                summarise, 
+                                seqnum = 1:length(id), 
+                                .progress="text"))
  |===============================================================================================| 100%
   user  system elapsed 
  63.52    0.03   63.85 

Is there a "better" way to do this? Unfortunately, I am on a Windows machine.

Thanks in advance.

EDIT: Data.table is extremely fast, but I can't get my sequence numbers to calc correctly. Here is what my ddply version created. The majority only have one record in the group, but some have 2 rows, 3 rows, etc.

> with(test.plyr, table(seqnum))
seqnum
    1     2     3     4     5 
24272  4950   681    88     9 

And using data.table as shown below, the same approach yields:

> with(test.dt, table(V1))
V1
    1 
24272 
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Btibert3
  • 38,798
  • 44
  • 129
  • 168

1 Answers1

5

Use data.table

dt = data.table(df)
test.dt = dt[,.N,"id,term"]

Here is a timing comparison. I used N = 3000 and replaced the 17000 with 1700 while generating the dataset

f_plyr <- function(){
  test.plyr <- ddply(df, .(id, term), summarise, seqnum = 1:length(id), 
 .progress="text")
}

f_dt <- function(){
 dt = data.table(df)
 test.dt = dt[,.N,"id,term"]
}

library(rbenchmark)
benchmark(f_plyr(), f_dt(), replications = 10,
  columns = c("test", "replications", "elapsed", "relative"))

data.table speeds up things by a factor of 170

test replications elapsed relative
2   f_dt()           10   0.779    1.000
1 f_plyr()           10 132.572  170.182

Also check out Hadley's latest work on dplyr. I wouldn't be surprised if dplyr provides an additional speedup, given that a lot of the code is being reworked in C.

UPDATE: Edited code, changing length(id) to .N as per Matt's comment.

Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • thanks for this. I know Hadley is working on it, but in the interim, I need a fix. The improvement in speed is crazy. On my machine, when I run with(test.df, table(V1)), all I see is a sequence number of 1. I know we are simulating data, but some records should have more than one record within the grouping. – Btibert3 Nov 16 '12 at 14:56
  • The problem is data.table sorts the columns differently. If you use `head(plyr::arrange(test.dt, id))`, you will see that the output is the same as `test.plyr` – Ramnath Nov 16 '12 at 14:57
  • Thanks for the help. Most likely user error on my end, but no matter what I do, V1 only has a value of 1 when I use the data.table solution. See above. – Btibert3 Nov 16 '12 at 15:40
  • @Btibert3 Yes, I don't think Ramnath has tested this. Please replace `length(id)` with `.N`. There are some previous questions in this tag about `.N` on this point. – Matt Dowle Nov 16 '12 at 15:57
  • @Matt thanks. I did not realize that aggregation does not work the same way as plyr. Good to know the `.N` trick. It is not documented in the wiki or the vignette. Any suggestions as to where I could go to get a more comprehensive picture of these tricks? – Ramnath Nov 16 '12 at 16:00
  • Thanks for this, we are moving forward. When we use `.N` , this is counting the number of records within the group, not generating a sequence number for all records within the block. For example, `nrow(df) == nrow(test.dt)` and `nrow(df) == nrow(test.plyr)` – Btibert3 Nov 16 '12 at 16:12
  • @Ramnath `.N` is documented in `?data.table`. There is a related FAQ (which could be clearer): FAQ 2.10. And here's the [recent question](http://stackoverflow.com/questions/13196296/why-is-length-of-a-group-variable-always-1-in-data-table-grouping). – Matt Dowle Nov 16 '12 at 16:26
  • 1
    @Btibert3 Please replace (just) `length(id)` with `.N`, not `1:length(id)` with `.N`. So, `1:.N`. Or, `seq_len(.N)`. – Matt Dowle Nov 16 '12 at 16:27
  • @MatthewDowle That did the trick....I am blown away by how fast that is. Thanks! – Btibert3 Nov 16 '12 at 16:44
  • @Btibert3 Excellent. Btw, with the recent updates (v1.8.6 is now on CRAN) it might be quickest to get up to date by reading NEWS thoroughly, especially from 1.8.4. Things like not-joins are done now, and `.I` for example. (For Ramnath as well.) – Matt Dowle Nov 16 '12 at 16:54