3

I'm trying to convert a "big" factor into a set of indicator (i.e. dummy, binary, flag) variables in R as such:

FLN <- data.frame(nnet::class.ind(FinelineNumber))

where FinelineNumber is a 5,000-level factor from Kaggle.com's current Walmart contest (the data is public if you'd like to reproduce this error).

I keep getting this concerning-looking warning:

In n * (unclass(cl) - 1L) : NAs produced by integer overflow

Memory available to the system is essentially unlimited. I'm not sure what the problem is.

smci
  • 32,567
  • 20
  • 113
  • 146
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • How many rows does your data have... `FLN <- data.frame(class.ind(paste(1:5000, "a")))` runs without problem on my old lappie. – user20650 Dec 18 '15 at 20:10
  • 1
    perhaps https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/sparse.model.matrix.html is useful – user20650 Dec 18 '15 at 20:13
  • 1
    I was going to agree with @user20650. It's going to be hard for people on limited-memory systems to reproduce this. On my laptop the results of `z <- factor(rep(1:5000,n)); FLN <- data.frame(nnet::class.ind(z))` are either, depending on `n`, (1) fine; (2) obvious errors about the matrix being too large, or being out of memory; (3) crashing my R session due to too-large memory requests – Ben Bolker Dec 18 '15 at 20:15
  • Maybe [this](http://stackoverflow.com/a/17650923/324364) is most directly instructive on what is likely going on. My guess is that if the indexing vector in that function didn't force coercions to integer, it might go through as numeric, with R's newish (?) use of indexing via large double values. – joran Dec 18 '15 at 20:19
  • 1
    @user20650 it has about 650,000 rows. It's running on a server with 36 cores and 100GB free RAM. I will give the sparse matrix function a try; thanks – Hack-R Dec 18 '15 at 20:23
  • @joran Could be. The resulting data types are numeric. If it does turn into integer data as some point in the process why would only 5,000 cause an overflow though? I suppose it would have to be related to the number of rows somehow? – Hack-R Dec 18 '15 at 20:24
  • 2
    Easily, because you're indexing the matrix, so it involves multiplying `5000L * 650000L`. – joran Dec 18 '15 at 20:25
  • @BenBolker Yea, it does require a lot of RAM; too much for a laptop unless maybe you have 16GB (to be fair, there's nothing I can do about that). So, I guess it boils down to there's too many rows? I wish I understood it better. Trying the sparse matrix solution now... – Hack-R Dec 18 '15 at 20:26
  • @joran Got it. I think that was the answer I was looking for. Thanks! – Hack-R Dec 18 '15 at 20:27

1 Answers1

6

The source code of nnet::class.ind is:

function (cl)     {
    n <- length(cl)
    cl <- as.factor(cl)
    x <- matrix(0, n, length(levels(cl)))
    x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
    dimnames(x) <- list(names(cl), levels(cl))
    x
}

.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:

imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6

You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [@joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to

x[(1:n) + n * (unclass(cl) - 1)] <- 1

possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]

Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.

 print(650*object.size(matrix(1L,5000,1000)),units="Gb")

I guess if you've got 100Gb free that could be OK ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 2
    Thanks much; good answer. I thought @user20650 was referring to the `fac2sparse` function in `Matrix` so I tried that instead of `sparse.model.matrix` and it also worked very well. – Hack-R Dec 18 '15 at 20:37