1

How can I read a sparse matrix that I have saved with Python as a *.npz file in R? I already came across two answers* on Stackoverflow but neither seems to do the job in my case.

The data set was created with Python from a Pandas data frame via:

scipy.sparse.save_npz(
     "data.npz",
     scipy.sparse.csr_matrix(DataFrame.values)
     )

It seems like the first steps for importing the data set in R are as follows.

library(reticulate)
np = import("numpy")
npz1 <- np$load("data.npz")

However, this does not yield a data frame yet.

*1 Load sparce NumPy matrix into R

*2 Reading .npz files from R

Stücke
  • 868
  • 3
  • 14
  • 41
  • 1
    The dataframe was not saved. A sparse CSR representation of the numpy array from its `values` was written. The npz is a zip archive of 4 arrays. You can 'extract' them with a OS `zip` tool. – hpaulj Jun 27 '22 at 07:31
  • 1
    `scipy.sparse.save_npz(name)` will create a new `csr` matrix, not a numpy array or dataframe. – hpaulj Jun 27 '22 at 10:08
  • In Python I can laod the data via `scipy.sparse.load_npz("data.npz" )`. Can I also load it in R? – Stücke Jun 27 '22 at 12:02
  • Can you load those 4 arrays? Can you make a `csr` format sparse matrix from scratch. I'm not an R user so can't help you with that. – hpaulj Jun 27 '22 at 14:37

1 Answers1

1

I cannot access your dataset, so I can only speak from experience. When I try loading a sparse CSR matrix with numpy, it does not work ; the class of the object is numpy.lib.npyio.NpzFile, which I can't use in R.

The way I found to import the matrix into an R object, as has been said in a post you've linked, is to use scipy.sparse.

library(reticulate)
scipy_sparse = import("scipy.sparse")
csr_matrix = scipy_sparse$load_npz("path_to_your_file")

csr_matrix, which was a scipy.sparse.csr_matrix object in Python (Compressed Sparse Row matrix), is automatically converted into a dgRMatrix from the R package Matrix. Note that if you had used scipy.sparse.csc_matrix in Python, you would get a dgCMatrix (Compressed Sparse Column matrix). The actual function doing the hardwork converting the Python object into something R can use is py_to_r.scipy.sparse.csr.csr_matrix, from the reticulate package.

If you want to convert the dgRMatrix into a data frame, you can simply use

df <- as.data.frame(as.matrix(csr_matrix))

although this might not be the best thing to do memory-wise if your dataset is big.

I hope this helped!

NoIdea
  • 113
  • 7