1

I'm trying to calculate auc but have a weird problem. When I run this script:

rm(list = ls(all = T))
gc()

library(Metrics)
library(glmnet)

nrows <- 92681
set.seed(456)
df1 <- data.frame(act1 = round(runif(nrows), 0), pred1 = runif(nrows))

Metrics::auc(df1$act1, df1$pred1)
glmnet::auc(df1$act1, df1$pred1)

I get:

> Metrics::auc(df1$act1, df1$pred1)
[1] 0.4930949
> glmnet::auc(df1$act1, df1$pred1)
[1] 0.4930949

When I add one more row and run this:

rm(list = ls(all = T))
gc()

library(Metrics)
library(glmnet)

nrows <- 92682
set.seed(456)
df1 <- data.frame(act1 = round(runif(nrows), 0), pred1 = runif(nrows))

Metrics::auc(df1$act1, df1$pred1)
glmnet::auc(df1$act1, df1$pred1)

I get :

> Metrics::auc(df1$act1, df1$pred1)
[1] NA
Warning message:
In n_pos * n_neg : NAs produced by integer overflow
> glmnet::auc(df1$act1, df1$pred1)
[1] 0.5011554

Any idea what's going on here?

screechOwl
  • 27,310
  • 61
  • 158
  • 267

2 Answers2

4

Metrics::auc uses a formula which includes the value (n_pos * n_neg) in the denominator, which in this case is 'sum(actual == 1) * sum(actual == 0)' which evaluate to integers 46308 * 46374 = 2147487192, which exceeds the largest integer you machine can handle (i.e. .Machine$integer.max).

For example:

46308 * 46374
#> 2147487192

as.integer(46308) * as.integer(46374)
#> [1] NA
#> Warning message:
#> In as.integer(46308) * as.integer(46374) : NAs produced by integer overflow
Jthorpe
  • 9,756
  • 2
  • 49
  • 64
  • What is the solution for this? – Arindam Ghosh Mar 22 '23 at 09:56
  • In short, this is a category of problem where the denominator is large if done the naive way, but there are lower memory ways of calculating it. I don't have time to code up a solution, but all you need to know is that the AUC is the mean of (a > b) for all values `a` among the cases and values `b` among the controls. You can partition the values among A and B and average the two. For example, you could use binary splits on your cases and controls, use the AUC method within those splits and then average the results appropriately (according to the size of the split). – Jthorpe Mar 22 '23 at 17:55
0

Just modified the function:

AUC <- function(y_pred, y_true){
            rank <- rank(y_pred)
            n_pos <- as.numeric(sum(y_true == 1))
            n_neg <- as.numeric(sum(y_true == 0))
            auc <- (sum(rank[y_true == 1]) - n_pos * (n_pos + 1)/2)/(n_pos * n_neg)
            return(auc)
}