6

I have a data.frame of hospital data with 11 million rows.

Columns: ID (chr), outcome (1|0), 20x ICD-10 codes (chr).
Rows: 10.6 million

I wish to make the data tidy to allow modelling of diagnostic codes to a binary outcome.

I would normally use pivot_longer or Base R aggregate function, but the resulting data.frame is huge and my machine struggles, due to memory (32gb RAM, windows server running latest R x64).

I am going to split the data.frame and pivot_longer for each and manually add columns to allow binding data.frame's after, or to model each split data.frame separately.

Is there a method I could use instead to reduce the data size or achieve a similar objective which I am missing?

ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
JisL
  • 161
  • 8

1 Answers1

3

Try using data.table::melt instead:

library(data.table)

DF <- data.frame(ID = LETTERS, replicate(10, sample(0:1, 26, rep=TRUE)))
setDT(DF)
melt(DF, id.vars = "ID")

library(data.table) provides a high-performance version of base R's data.frame (focus on speed and memory efficiency).

Please also see this related benchmark.

ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
  • Thanks, will give it a try. I am interested to know if there is another way of handling large datasets, to create tidy data apart from reshaping? – JisL Mar 15 '22 at 13:36
  • 1
    @JisL it all depends on the context. If e.g. your modeling function requires a certain format regarding its input data there is no way getting around reshaping. – ismirsehregal Mar 15 '22 at 13:45
  • @JisL did it help? – ismirsehregal Mar 19 '22 at 08:47
  • I have tried this but unfortunately it still throws an error as the resulting DT has >6k columns. I am trying a split, melt, combine method at the moment. – JisL Mar 20 '22 at 06:10
  • For this library([disk.frame](https://github.com/DiskFrame/disk.frame)) or library([arrow](https://github.com/apache/arrow/tree/master/r)) might be of interest. Please check: [vignette("dataset", package = "arrow")](https://arrow.apache.org/docs/r/articles/dataset.html#processing-data-in-batches). – ismirsehregal Mar 20 '22 at 07:45