The source code of nnet::class.ind
is:
function (cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
x
}
.Machine$integer.max
is 2147483647. If n*(nlevels - 1L)
is greater than this value that should produce your error. Solving for n
:
imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6
You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix
(or Matrix::fac2sparse
), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind
to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [@joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to
x[(1:n) + n * (unclass(cl) - 1)] <- 1
possibly throwing in an explicit as.numeric()
here or there to force the coercion to double ...]
Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.
print(650*object.size(matrix(1L,5000,1000)),units="Gb")
I guess if you've got 100Gb free that could be OK ...