I have a datatable, DT, with columns A, B and C. I want only one A per unique B, and I want to choose that A based on the value of C (choose the largest C).
Based on this (incredibly helpful) SO page, Use data.table
to get first of subgroup based on a variable, I tried something like this:
test <- data.table(A=c(1:3,1:2),B=c(1:5),C=c(11:15))
setkey(test,A,C)
test[,.SD[.N],by="A"]
In my test case, this gives me an answer that seems right:
# A B C
# 1: 1 6 16
# 2: 2 7 17
# 3: 3 8 18
# 4: 4 4 14
# 5: 5 5 15
And, as expected, the number of rows matches the number of unique entries for "A" in my DT:
length(unique(test$A))
# 5
However, when I apply this to my actual dataset, I am missing approximately 20% of my initially ~2 million rows.
I cannot seem to put together a test dataset that will recreate this type of a loss. There are no null values in the actual dataset. What else could be a factor in a dataset that would cause a discrepancy between the number of results from something like test[,.SD[.N],by="A"]
and length(unique(test$A))
?