How to get length of current group in data.table grouping?

Question

I know this can be achieved with other packages, but I'm trying to do it in data.table (as it seems to be the fastest for grouping).

library(data.table)
dt = data.table(a=c(1,2,2,3))
dt[,length(a),by=a]

results in

whereas

df = data.frame(a=c(1,2,2,3))
ddply(df,.(a),summarise,V1=length(a))

produces

which is a more sensible results. Just wondering why data.table is not giving the same results, and how this can be achieved.

Josh O'Brien · Accepted Answer · 2012-11-02T14:12:05.733

21

The data.table way to do this is to use special variable, .N, which keeps track of the number of rows in the current group. (Other special variables include .SD, .BY (in version 1.8.2) and .I and .GRP (available from version 1.8.3). All are documented in ?data.table):

library(data.table)
dt = data.table(a=c(1,2,2,3))

dt[, .N, by = a]
#    a N
# 1: 1 1
# 2: 2 2
# 3: 3 1

To see why what you tried didn't work, run the following, checking the value of a and length(a) at each browser prompt:

dt[, browser(), by = a]

edited Nov 02 '12 at 14:12

answered Nov 02 '12 at 13:48

Josh O'Brien

159,210
26
366
455

1

+1 @jamborta Also see [FAQ 2.10](http://datatable.r-forge.r-project.org/datatable-faq.pdf) for some background. The reason for it is efficiency to avoid repeating the same group value through a potentially long vector (time and space). In ops with longer vectors, R will recycle length 1 vectors anyway, if and when needed. So `.N` is the way to go here. – Matt Dowle Nov 02 '12 at 15:10

How to get length of current group in data.table grouping?

1 Answers1

Linked