I have a fairly large dataset (by my standards) and I want to create a sequence number for blocks of records. I can use the plyr package, but the execution time is very slow. The code below replicates a comparable size dataframe.
## simulate an example of the size of a normal data frame
N <- 30000
id <- sample(1:17000, N, replace=T)
term <- as.character(sample(c(9:12), N, replace=T))
date <- sample(seq(as.Date("2012-08-01"), Sys.Date(), by="day"), N, replace=T)
char <- data.frame(matrix(sample(LETTERS, N*50, replace=T), N, 50))
val <- data.frame(matrix(rnorm(N*50), N, 50))
df <- data.frame(id, term, date, char, val, stringsAsFactors=F)
dim(df)
In reality, this is a little smaller than what I work with, as the values are typically larger...but this is close enough.
Here is the execution time on my machine:
> system.time(test.plyr <- ddply(df,
+ .(id, term),
+ summarise,
+ seqnum = 1:length(id),
+ .progress="text"))
|===============================================================================================| 100%
user system elapsed
63.52 0.03 63.85
Is there a "better" way to do this? Unfortunately, I am on a Windows machine.
Thanks in advance.
EDIT: Data.table is extremely fast, but I can't get my sequence numbers to calc correctly. Here is what my ddply version created. The majority only have one record in the group, but some have 2 rows, 3 rows, etc.
> with(test.plyr, table(seqnum))
seqnum
1 2 3 4 5
24272 4950 681 88 9
And using data.table as shown below, the same approach yields:
> with(test.dt, table(V1))
V1
1
24272