Operations on a calculated .BY in data.table

Question

As an extension of this question, I'd like to run calculations that include a .BY variable that is itself the product of a calculation. The questions I've reviewed use a key that merely accesses, but does not transform or aggregate, an existing value.

In this example, I'm trying to produce a ROC for a binary classifier with a function that exploits data.table (because ROC calculations in existing packages are quite slow). In this case, the .BY variable is the cutpoint, and the calculations are the true positive and false positive rates for a probability estimate at that cutpoint.

I am able to do this with an intermediate data.table, but I am looking for a more efficient solution. This works:

# dummy example
library(data.table)
dt <- setDT(get(data(GermanCredit, package='caret'))
            )[, `:=`(y = as.integer(Class=='Bad'),
                     Class = NULL)]
model <- glm(y ~ ., family='binomial', data=dt)
dt[,y_est := predict(model, type='response')]

#--- Generate ROC with specified # of cutpoints  ---
# level of resolution of ROC curve -- up to uniqueN(y_est)
res <- 5 

# vector of cutpoints (thresholds for y_est)
cuts <- dt[,.( thresh=quantile(y_est, probs=0:res/res) )]

# at y_est >= each threshold, how many true positive and false positives?
roc <-  cuts[, .( tpr = dt[y_est>=.BY[[1]],sum(y==1)]/dt[,sum(y==1)],
                  fpr = dt[y_est>=.BY[[1]],sum(y==0)]/dt[,sum(y==0)]
                 ), by=thresh]

plot(tpr~fpr,data=roc,type='s') # looks right

But this doesn't work:

# this doesn't work, and doesn't have access to the total positives & negatives
dt[, .(tp=sum( (y_est>=.BY[[1]]) & (y==1)  ),
       fp=sum( (y_est>=.BY[[1]]) & (y==0)  ) ),
   keyby=.(thresh= quantile(y_est, probs=0:res/res) )]
# Error in `[.data.table`(dt, , .(tp = sum((y_est >= .BY[[1]]) & (y == 1)),  : 
#   The items in the 'by' or 'keyby' list are length (6).
#   Each must be same length as rows in x or number of rows returned by i (1000).

Is there an idiomatically data.table (or at least more efficient) way to do this?

score 2 · Accepted Answer · answered Jun 26 '17 at 20:39

2

You could use non-equi joins:

dt[.(thresh = quantile(y_est, probs=0:res/res)), on = .(y_est >= thresh),
   .(fp = sum(y == 0), tp = sum(y == 1)), by = .EACHI][,
   lapply(.SD, function(x) x/x[1]), .SDcols = -"y_est"]
#           fp          tp
#1: 1.00000000 1.000000000
#2: 0.72714286 0.970000000
#3: 0.46857143 0.906666667
#4: 0.24142857 0.770000000
#5: 0.08142857 0.476666667
#6: 0.00000000 0.003333333

answered Jun 26 '17 at 20:39

eddi

49,088
6
104
155

This is magically wonderful, thank you. I understand lines 2 & 3 -- that `x/x[1]` trick is clever. But line 1: I am trying to grok the non-equi join, which is apparently a new-ish feature (v1.9.8). Can you help me understand what is going on with `i` in line 1? There is no `Y` as in `X[Y]` because it's a self-join? – C8H10N4O2 Jun 26 '17 at 20:56
The dot there is equivalent to typing `data.table`, so it's as if I wrote `dt[cuts,...` – eddi Jun 26 '17 at 21:05

Operations on a calculated .BY in data.table

1 Answers1