1

I have a datatable, DT, with columns A, B and C. I want only one A per unique B, and I want to choose that A based on the value of C (choose the largest C).

Based on this (incredibly helpful) SO page, Use data.table to get first of subgroup based on a variable, I tried something like this:

test <- data.table(A=c(1:3,1:2),B=c(1:5),C=c(11:15))
setkey(test,A,C)
test[,.SD[.N],by="A"]

In my test case, this gives me an answer that seems right:

#     A  B   C
# 1:  1  6  16
# 2:  2  7  17
# 3:  3  8  18
# 4:  4  4  14
# 5:  5  5  15

And, as expected, the number of rows matches the number of unique entries for "A" in my DT:

length(unique(test$A))
# 5

However, when I apply this to my actual dataset, I am missing approximately 20% of my initially ~2 million rows.

I cannot seem to put together a test dataset that will recreate this type of a loss. There are no null values in the actual dataset. What else could be a factor in a dataset that would cause a discrepancy between the number of results from something like test[,.SD[.N],by="A"] and length(unique(test$A))?

Community
  • 1
  • 1
neuropsych
  • 305
  • 3
  • 11
  • "something like" or those exact commands, just on a different data set? – eddi Jan 23 '16 at 00:30
  • 2
    Also, it's unclear to me why you don't simply reduce the dataset until you find the smallest possible one that still exhibits whatever undesired behavior. Simply remove say the last N rows - do you still get the same issue? No, then the problem is in the last N rows. Yes, then remove more rows. Binary search over 2M rows will find the problem in at most 21 steps. – eddi Jan 23 '16 at 00:37
  • @eddi: The exact commands, just with the appropriate column names changed. And, good point. Will do now.... – neuropsych Jan 23 '16 at 00:43
  • @eddi: Ah! Thank you!! The answer is: when I imported the table from its original SQL format, some of the columns were changed into scientific notation. I thought this was just annoying from the perspective of being able to notice when two values varied only in the final digit visually, but it looks like (to my surprise) it was essentially rounding them in the by="A" subsetting, but tracking that they were actually unique in the length(unique()) case. Thanks for the debugging coaching - much appreciated! – neuropsych Jan 23 '16 at 00:56
  • great, glad it worked out! comparisons of floating point numbers is always a tricky issue – eddi Jan 23 '16 at 00:59

1 Answers1

3

Thanks to @Eddi's debugging coaching, here's the answer, at least for my dataset: differential handling of numbers in scientific notation.

In particular: In my actual dataset, columns A and B were very long numbers that, upon import from SQL to R, had been imported in scientific notation. It turns out the test[,.SD[.N],by="A"] and length(unique(test$A)) commands were handling this differently: length(unique(test$A)) was preserving the difference between two values that differed only in a small digit that is not visible in the collapsed scientific notation format printed as visual output, but test[,.SD[.N],by="A"] was, in essence, rounding the values and thus collapsing some of them together.

(I feel foolish that I didn't catch this myself before posting, but much appreciate the help - I hope somehow this spares someone else the same confusion, perhaps!)

neuropsych
  • 305
  • 3
  • 11