2

I am trying to turn an .rds file into a .feather file for reading with Pandas in Python.

library(feather)

# Set working directory
data = readRDS("file.rds")
data_year = data[["1986"]]

# Try 1
write_feather(
  data_year,
  "data_year.feather"
  )

# Try 2
write_feather(
  as.data.frame(as.matrix(data_year)),
  "data_year.feather"
)

Try 1 returns Error: 'x' must be a data frame and Try 2 actually writes a *.feather file but the file has a size of 4.5GB for a single year whereas the original *.rds file has a size of 0.055GB for several years.

How can I turn the file into separate or non-separate *.feather files for each year whilst maintaining an adequate file size?

enter image description here

data looks like this:

enter image description here

data_year looks like this:

enter image description here

*Update

I am open to any suggestions for making the data available for use in NumPy/Pandas whilst maintaining a modest file size!

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
Stücke
  • 868
  • 3
  • 14
  • 41
  • 1
    `data_year` has length 576M but is a sparse matrix of class `dgCMatrix`. When coerced to data.frame, it will become large, I'm not seeing a way to avoid it. – Rui Barradas Jan 22 '22 at 13:15
  • Thank you a lot for your comment! Would there be any other way to make the data available for use in NumPy/Pandas whilst keeping a modest file size? – Stücke Jan 23 '22 at 08:29
  • 1
    You can find your answer [here](https://stackoverflow.com/questions/40996175/loading-a-rds-file-in-pandas). – Erfan Ghasemi Jan 23 '22 at 09:15
  • @ErfanGhasemi Thank you cor your comment. `pyreadr.read_r('file.rds')` returns `LibrdataError: The file contains an unrecognized object`. The answer of the user mgalardini via your link returns a list vector with each item being an `RS4 object`. I have no idea what that is. Certainly not a pandas data frame. I cannot find an answer via the link you provided. – Stücke Jan 23 '22 at 11:30
  • 1
    Maybe instead of converting from a sparse matrix, you could see if the is a python method to read in the format of sparse matrix that `Matrix::writeMM` writes out EDIT: you could try read in to python using either https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.mmread.html or https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.hb_read.html depending on what format you write out from R with – user20650 Jan 25 '22 at 14:06

2 Answers2

3

Maybe something like the following function can be of help.

The function reshapes the sparse matrix to long format eliminating the zeros from it. This will reduce the final data.frame size and disk file size.

library(Matrix)
library(feather)

dgcMatrix_to_long_df <- function(x) {
  res <- NULL
  if(nrow(x) > 0L) {
    for(i in 1:nrow(x)){
      d <- as.matrix(x[i, , drop = FALSE])
      d <- as.data.frame(d)
      d$row <- i
      d <- tidyr::pivot_longer(d, cols = -row, names_to = "col")
      d <- d[d$value != 0,]
      res <- rbind(res, d)
    }
  }
  res
}

y <- dgcMatrix_to_long_df(data_year)
head(y)
## A tibble: 6 x 3
#    row col      value
#  <int> <chr>    <dbl>
#1     1 Col_0103    51
#2     1 Col_0149     6
#3     1 Col_0188     5
#4     1 Col_0238    89
#5     1 Col_0545    14
#6     1 Col_0547    58


path <- "my_data.feather"
write_feather(y, path)
z <- read_feather(path)
identical(y, z)
#[1] TRUE

# The file size is 232 KB though the initial matrix
# had 1 million elements stored as doubles, 
# for a total of 8 MB, a saving of around 97%
file.size(path)/1024
#[1] 232.0234

Edit

The following function is much faster.

dgcMatrix_to_long_df2 <- function(x) {
  res <- NULL
  if(nrow(x) > 0L) {
    for(i in 1:nrow(x)){
      d <- as.matrix(x[i, , drop = FALSE])
      inx <- which(d != 0, arr.ind = TRUE)
      d <- cbind(inx, value = c(d[d != 0]))
      d[, "row"] <- i
      res <- rbind(res, d)
    }
  }
  as.data.frame(res)
}

system.time(y <- dgcMatrix_to_long_df(data_year))
#   user  system elapsed 
#   7.89    0.04    7.92 
system.time(y <- dgcMatrix_to_long_df2(data_year))
#   user  system elapsed 
#   0.14    0.00    0.14

Test data

set.seed(2022)
n <- 1e3
x <- rep(0L, n*n)
inx <- sample(c(FALSE, TRUE), n*n, replace = TRUE, prob = c(0.99, 0.01))
x[inx] <- sample(100, sum(inx), replace = TRUE)
data_year <- Matrix(x, n, n, dimnames = list(NULL, sprintf("Col_%04d", 1:n)))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Again, thank you very much for your answer! I ran `test_data = readRDS("file.rds")` and then `test_year = dgcMatrix_to_long_df2(test_data[["1986"]])`. The result is a `data.frame` with the row and col index of non-zero values if I understand correctly. Unfortunately, this makes the import and "recreation" of the structure a bit tricky. I was hoping for a *simple* export/import option but I guess this simply does not exist then. ... – Stücke Jan 25 '22 at 13:29
  • 2
    ps `Matrix` has a summary method (which I think is what you are doing here) e.g. `summary(data_year)` – user20650 Jan 25 '22 at 14:16
2

With scipy and rpy2, you can read each dgCMatrix object directly into Python as a scipy.sparse.csc_matrix object. Both use compressed sparse column (CSC) format, so there is actually zero need for preprocessing. All you need to do is pass the attributes of the dgCMatrix object as arguments to the csc_matrix constructor.

To test it out, I used R to create an RDS file storing a list of dgCMatrix objects:

library("Matrix")
set.seed(1L)

d <- 6L
n <- 10L
l <- replicate(n, sparseMatrix(i = sample(d), j = sample(d), x = sample(d), repr = "C"), simplify = FALSE)
names(l) <- as.character(seq(1986L, length.out = n))

l[["1986"]]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##                 
## [1,] . . 5 . . .
## [2,] 3 . . . . .
## [3,] . . . . . 6
## [4,] . 2 . . . .
## [5,] . . . . 1 .
## [6,] . . . 4 . .

saveRDS(l, file = "list_of_dgCMatrix.rds")

Then, in Python:

from scipy import sparse
from rpy2  import robjects
readRDS = robjects.r['readRDS']

l = readRDS('list_of_dgCMatrix.rds')
x = l.rx2('1986') # in R: l[["1986"]]
x
## <rpy2.robjects.methods.RS4 object at 0x120db7b00> [RTYPES.S4SXP]
## R classes: ('dgCMatrix',)

print(x)
## 6 x 6 sparse Matrix of class "dgCMatrix"
##                 
## [1,] . . 5 . . .
## [2,] 3 . . . . .
## [3,] . . . . . 6
## [4,] . 2 . . . .
## [5,] . . . . 1 .
## [6,] . . . 4 . .

data    = x.do_slot('x')   # in R: x@x
indices = x.do_slot('i')   # in R: x@i
indptr  = x.do_slot('p')   # in R: x@p
shape   = x.do_slot('Dim') # in R: x@Dim or dim(x)

y = sparse.csc_matrix((data, indices, indptr), tuple(shape))
y
## <6x6 sparse matrix of type '<class 'numpy.float64'>'
##         with 6 stored elements in Compressed Sparse Column format>

print(y)
##   (1, 0)       3.0
##   (3, 1)       2.0
##   (0, 2)       5.0
##   (5, 3)       4.0
##   (4, 4)       1.0
##   (2, 5)       6.0

Here, y is an object of class scipy.sparse.csc_matrix. You should not need to use the toarray method to coerce y to an array with dense storage. scipy.sparse implements all of the matrix operations that I can imagine needing. For example, here are the row and column sums of y:

y.sum(1) # in R: as.matrix(rowSums(x))
## matrix([[5.],
##         [3.],
##         [6.],
##         [2.],
##         [1.],
##         [4.]])

y.sum(0) # in R: t(as.matrix(colSums(x)))
## matrix([[3., 2., 5., 4., 1., 6.]])
Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48
  • Great, thank you! This is exactly what I need + the last step which you also mentioned `array = sparse.csc_matrix.toarray(y)`. Thanks! – Stücke Jan 27 '22 at 07:17
  • 1
    I was trying express that `toarray` will consume a lot of memory given the size of your matrices. You might be able to do what you need without `toarray`, using the sparse matrix methods available in `scipy.sparse`. Of course, the decision is yours. Glad this helped. – Mikael Jagan Jan 27 '22 at 07:43