I have generated a large sparse matrix in Python in the COO format and it needs to be processed in R. The COO sparse matrix contains more than 2^31-1 non-zero entries. I tried to save the COO sparse matrix in .npz and rebuild it in R.
The COO sparse matrix has a shape of (1119534, 239415) with 2 230 643 376 non-zero entries.
Code in R
library(Matrix)
library(Rcpp)
library(reticulate)
np <- import("numpy")
npz <- np$load("LARGE_SPARSE_COO.npz")
i = as.numeric(npz$f[["row"]])
j = as.numeric(npz$f[["col"]])
v = as.numeric(npz$f[["data"]])
dims = as.numeric(npz$f[["shape"]])
X <- sparseMatrix(i, j, x=v, index1=FALSE, dims=dims)
When non-zero entries < 2^31-1, the above code is ok but when it is greater than 2^31-1, the following error occurs
Error in py_ref_to_r(x)
:
negative length vectors are not allowed**
Calls: as.vector ... py_to_r.numpy.ndarray -> NextMethod -> py_to_r.default -> py_ref_to_r
And I think this is due to the vector size exceeding the 32-bit limit. However I think R supports 64-bit size vector as long vector. How could I save the row, col and data from the .npz as a long vector format and pass to sparseMatrix
? or is there any other way to rebuild such a large sparse matrix in R?
I cannot reduce the size of the COO sparse matrix, and some of my matrices have even more non-zero entries. Any help/insight is appreciated.
Edit 1
I am aware of the spam
/spam64
package in R, but have no idea how to use it in my case. Also I am not sure if the sparse matrix format from spam
will be accepted by glmnet
, which the sparse matrix will be finally passed to.