2

I have generated a large sparse matrix in Python in the COO format and it needs to be processed in R. The COO sparse matrix contains more than 2^31-1 non-zero entries. I tried to save the COO sparse matrix in .npz and rebuild it in R.

The COO sparse matrix has a shape of (1119534, 239415) with 2 230 643 376 non-zero entries.

Code in R

library(Matrix)
library(Rcpp)
library(reticulate)

np <- import("numpy")
npz <- np$load("LARGE_SPARSE_COO.npz")

i = as.numeric(npz$f[["row"]])                                    
j = as.numeric(npz$f[["col"]])                                                 
v = as.numeric(npz$f[["data"]])                                                
dims = as.numeric(npz$f[["shape"]])                                            

X <- sparseMatrix(i, j, x=v, index1=FALSE, dims=dims)  

When non-zero entries < 2^31-1, the above code is ok but when it is greater than 2^31-1, the following error occurs

Error in py_ref_to_r(x):

negative length vectors are not allowed**

Calls: as.vector ... py_to_r.numpy.ndarray -> NextMethod -> py_to_r.default -> py_ref_to_r

And I think this is due to the vector size exceeding the 32-bit limit. However I think R supports 64-bit size vector as long vector. How could I save the row, col and data from the .npz as a long vector format and pass to sparseMatrix? or is there any other way to rebuild such a large sparse matrix in R?

I cannot reduce the size of the COO sparse matrix, and some of my matrices have even more non-zero entries. Any help/insight is appreciated.

Edit 1

I am aware of the spam/spam64 package in R, but have no idea how to use it in my case. Also I am not sure if the sparse matrix format from spam will be accepted by glmnet, which the sparse matrix will be finally passed to.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
bingung
  • 145
  • 1
  • 6
  • Crazy idea is to open file is Python and serialise to rds file use py2r or whatever – MDEWITT Sep 14 '19 at 01:06
  • I tried to use ```numpy2ri``` to convert the numpy array stored internally in the sparse matrix --- ```numpy2ri(X.data)```, but the same error appears – bingung Sep 15 '19 at 21:51
  • It looks like it is something in `reticulate` and type conversion. Importantly, is the version of Python you are using 64 bit? Lots of discussion on this topic here https://github.com/rstudio/reticulate/issues/323. Might be worth posting on the R Studio community too. Sorry to not be more help! – MDEWITT Sep 16 '19 at 00:55
  • @MDEWITT Thanks for your help! And yes, it is python 3.7 running on 64-bit system – bingung Sep 16 '19 at 21:13

0 Answers0