0

Problem:

I am trying to perform a correlation test on a large dataset: the data.table can exist in memory, but operating on it with Hmisc::rcorr() or corrr::correlate() eventually runs into the memory limit.

> Error: cannot allocate vector of size 1.1 Gb

So, I moved to using the filebacked disk.frame package to solve this, but I still am reaching the memory limit.

Any advice on how to use disk.frame or another package dealing with big memory to achieve this is much appreciated.

Both rcorr() and correlate() take and operate on the whole dataset. The dataset contains NA values, hence my need to use these functions as they allow handling of missing values with "pairwise.complete.obs".

Attempts:

# Packages ----
library(corrr)
library(Hmisc)
library(disk.frame)
library(data.table)


# Initialise parallel processing backend
setup_disk.frame()

# Enable large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)


# test_DT is a data.table of ~18000 columns and ~800 rows
# of type `num` (`double`) 


# Create filebacked disk.frame ----
test_DT_df <- as.disk.frame(
  test_DT, 
  outdir = file.path(tempdir(), "test_tmp.df"),
  nchunks = recommend_nchunks(test_DT, conservatism = 4),
  overwrite = TRUE
)


# `Hmisc` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
  cmap(
    .x = test_DT_df,
    .f = function(.x) {
      Hmisc::rcorr(
        x = as.matrix(.x),
        type = "pearson"
      )
    }
  ),
  overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)


# `corrr` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
  cmap(
    .x = test_DT_df,
    .f = function(.x) {
      corrr::correlate(
        x = .x,
        use = "pairwise.complete.obs",
        method = "pearson"
      )
    }
  ),
  overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)


# Cleanup ----
delete(test_DT_df)
delete(test_cor)
rm(test_DT_df, test_cor, test_cor_collect)
gc()
Buzz B
  • 75
  • 7
  • What is the size of your data exactly? – F. Privé Nov 11 '21 at 13:45
  • The ```data.table``` object in memory is 94.2 MB. The ```disk.frame``` files in storage are in total 89.9 MB. I forced the ```Hmisc::rcorr()``` function to be able to complete execution on the ```data.table %>% as.matrix``` by running ```memory.limit(size = 56000)``` beforehand (this took some time with much disk threshing); this produced the resulting object in memory of 5.3 GB. – Buzz B Nov 12 '21 at 01:15
  • Yes, my question was to get a sense of the number of columns that you have. Because the correlation is square of that. You should look at packages bigmemory and bigstatsr (disclaimer: I'm the author of bigstatsr) if you want to store some data on disk (here the result I guess). Then you can loop over all the pairwise variables and store the result in the on-disk matrix. – F. Privé Nov 12 '21 at 08:02
  • Thank you. It is ~18000 columns and ~800 rows. I tried ```bigmemory``` and ```bigstatsr```, but ran into difficulty ([bigmemory issue](https://stackoverflow.com/questions/69848972/how-to-perform-hmiscrcorr-with-big-memory-data-in-r)). I don't think it handles data with ```NA``` values (I need to use ```"pairwise.complete.obs"```, so can't remove all rows containing ```NA``` from dataset beforehand). I would appreciate seeing an example of how to loop over the pairwise variables in the method you describe to store it on disk. – Buzz B Nov 12 '21 at 09:49
  • Just a loop over all columns (i, j, i < j), where you then compute correlation between variable i and variable j and store it in res_i,j and res_j,i. – F. Privé Nov 12 '21 at 15:07
  • Thank you, but I'm afraid I still find that difficult to follow. I'm assuming you mean 2 nested for loops (i, j), but I don't understand what is meant by i < j ? I don't understand why there are 2 objects being stored, and what goes into each? Is it possible to demonstrate the code in an answer, even if it's just the skeleton? – Buzz B Nov 13 '21 at 16:04

1 Answers1

1

An answer to explain my comment "Then you can loop over all the pairwise variables and store the result in the on-disk matrix.":

res <- bigstatsr::FBM(4, 4)
for (j in seq_len(4)) {
  for (i in seq_len(j - 1)) {
    corr <- Hmisc::rcorr(iris[[j]], iris[[i]])
    res[i, j] <- res[j, i] <- corr$r[1, 2]
  }
  res[j, j] <- 1
} 
res[]
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Many thanks @F. Privé. The ```FBM``` does not hold ```dimnames```. How can I apply the ```dimnames``` to the ```res[]```? My intention is to create the output in long format. Is there a way to structure ```res``` so that the pairwise ```i``` and ```j``` all exist in 2 columns with ```corr$r``` in a 3rd column. But if that's not possible then I'm thinking to use the ```melt()``` of ```as.data.table(res[])```), however I still need to apply the correct ```dimnames``` to ```res[]``` for it to work. – Buzz B Nov 17 '21 at 16:54
  • 1
    By long format, you mean storing, i, j, and r_i,j? Just store it like that directly then. – F. Privé Nov 17 '21 at 18:27
  • Oh I see, thanks for your patience with my questions. Can I check, is this the correct interpretation of your previous comment: ```res[i] <- res[j] <- corr$r[1, 2]```? Secondly, how could I apply ```names``` (indicating the columns compared) to ```res[]``` so that they are in the correct order? – Buzz B Nov 17 '21 at 23:57
  • I don't understand what you want exactly. And `res[]` is just a standard R matrix. – F. Privé Nov 18 '21 at 09:06
  • Essentially I'd like to have the output be a ```data.table```/```data.frame```... with 2 columns corresponding to the ```colnames``` of the dataset (e.g. ```iris```) being compared, followed by a column for ```corr$r```, ```corr$P```, and ```corr$n```. – Buzz B Nov 18 '21 at 15:25