I'm observing a weird behavior with data.table in R and I'm wondering whether this is a bug. Here is the code I use:
num_req <- fread("number_requests.csv")
num_req[, nrequests := sum(nrequests, na.rm=T), by = list(reqtype, server, timestamp)]
The part of the output I find weird is the following:
nrequests timestamp reqtype server
3: 22 1489276860 reqtype1 server1
4: 22 1489276860 reqtype1 server1
As you can see, I end up with a duplicate row even though all keys are exactly the same.
If re-run the aggregation on the output I get
nrequests timestamp reqtype server
3: 44 1489276860 reqtype1 server1
4: 44 1489276860 reqtype1 server1
Now, if I do
tmp2 <- num_req[, list(nrequests = sum(nrequests, na.rm=T)), by = list(reqtype, server, timestamp)]
then I do not get any duplicated rows.
Some info:
The dataset has 75015360 rows but even working with a subset (
mysubset <- num_req[timestamp==1489276860]
) shows the same duplication if I use the assignment by referenceI'm using data.table version 1.10.4
- I'm using R version 3.3
- I'm working on RStudio Server version 1.0.44
- the host virtual machine is Red Hat Enterprise Linux Server release 6.8 (Santiago)
The question is whether I'm misusing the :=
operator or if this is a bug?
Thanks for your help!