Assignment by reference and aggregation creates duplicates in data table R

Question

I'm observing a weird behavior with data.table in R and I'm wondering whether this is a bug. Here is the code I use:

num_req <- fread("number_requests.csv")
num_req[, nrequests := sum(nrequests, na.rm=T), by = list(reqtype, server, timestamp)]

The part of the output I find weird is the following:

   nrequests timestamp    reqtype   server   
3:        22 1489276860   reqtype1  server1
4:        22 1489276860   reqtype1  server1

As you can see, I end up with a duplicate row even though all keys are exactly the same.

If re-run the aggregation on the output I get

   nrequests timestamp    reqtype   server   
3:        44 1489276860   reqtype1  server1
4:        44 1489276860   reqtype1  server1

Now, if I do

tmp2 <- num_req[, list(nrequests = sum(nrequests, na.rm=T)), by = list(reqtype, server, timestamp)]

then I do not get any duplicated rows.

Some info:

The dataset has 75015360 rows but even working with a subset (mysubset <- num_req[timestamp==1489276860]) shows the same duplication if I use the assignment by reference
I'm using data.table version 1.10.4
I'm using R version 3.3
I'm working on RStudio Server version 1.0.44
the host virtual machine is Red Hat Enterprise Linux Server release 6.8 (Santiago)

The question is whether I'm misusing the := operator or if this is a bug?

Thanks for your help!

You're kind of misusing the `:=` operator which does exactly what it's supposed to do (adding/updating a colum by reference) — talat, May 18 '17 at 10:39
Thanks for the quick reply docendo... if I understand you correctly, if the output data.table is smaller than the input data.table (because of the group by) I should not be using tthe `:=` operator? — , May 18 '17 at 15:29
`:=` updates every row of `num_req` _in place_ by overwriting `nrequests` with the respective group result. Without, a new object is created with one row per group. Compare `DT <- data.table(x1 = rep(LETTERS[1:2]), x2 = 1:4); DT[, .(x2 = sum(x2)), x1]; DT[, x2 := sum(x2), x1][]`. There is no good or bad in this case. It depends on your requirements. — Uwe, May 18 '17 at 17:06

0 Answers0