1

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.

Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.

Thanks

wanzo
  • 11
  • 3

1 Answers1

1

Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.

You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).

So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.

Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).

[*] Increasing (or decreasing) the memory available to R processes

lebatsnok
  • 6,329
  • 2
  • 21
  • 22