1

I am trying to reshape a (very) long table into a wide (very sparse) table.

The dimensions:

dim(data)
[1] 16146436        3

If i attempt the standard dcast operation it fails due to running out of memory:

datac <- dcast(formula=gene ~ sample, value.var="Coverage", data=data)
Error: cannot allocate vector of size 23399.6 Gb

Any suggestions on either making dcast run or alternatives that are optimized for large very sparse datasets?

Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
  • 2
    Try `library(data.table);` `dcast.data.table(formula=gene ~ sample, value.var="Coverage", data=data)` – Sathish Aug 16 '16 at 20:56
  • Unfortunately that has the same issue: `> dcast.data.table(formula=gene ~ sample, value.var="Coverage", data=data.table(data)) Error: cannot allocate vector of size 11699.8 Gb` – Asker Brejnrod Aug 16 '16 at 21:12
  • data.table is usually preferred for large data set. If possible, try the divide and conquer approach by doing dcast for subsets of data. – Sathish Aug 16 '16 at 21:15
  • save your data in the disk, free workspace, and then read it using `fread`. You can control the number of rows by `nrows` argument in `fread` – Sathish Aug 16 '16 at 21:20
  • for example: `dcast.data.table(fo‌​rmula=gene ~ sample, value.var="Coverage"‌​, data= fread("filename", nrows = 1000L))` – Sathish Aug 16 '16 at 21:21
  • `fread` does not have the ability to specify number of rows or its interval. If possible, split the data into chunks and save it in disk, then with `fread` read the chunks and `dcast` them. A better approach would be to read chunks of data from a database. – Sathish Aug 16 '16 at 21:30
  • 1
    It might be helpful to specify in what ways that table is sparse. Suggest providing the first 20 lines of data. – IRTFM Aug 16 '16 at 21:33
  • Yeah, if it's sparse in the sense of having a ton of zeros in Coverage, I'd just filter out the zeros and find a way to work with the data in long format. Alternately, some packages support sparse matrix data structures. – Frank Aug 16 '16 at 21:35
  • use `data.table::tables()` to track the memory space used by data.table objects. – Sathish Aug 16 '16 at 21:54
  • 3
    https://www.r-bloggers.com/casting-a-wide-and-sparse-matrix-in-r/ – Sathish Aug 16 '16 at 21:58

0 Answers0