Problem:
I am trying to perform a correlation test on a large dataset: the data.table
can exist in memory, but operating on it with Hmisc::rcorr()
or corrr::correlate()
eventually runs into the memory limit.
> Error: cannot allocate vector of size 1.1 Gb
So, I moved to using the filebacked disk.frame
package to solve this, but I still am reaching the memory limit.
Any advice on how to use disk.frame
or another package dealing with big memory to achieve this is much appreciated.
Both rcorr()
and correlate()
take and operate on the whole dataset. The dataset contains NA
values, hence my need to use these functions as they allow handling of missing values with "pairwise.complete.obs"
.
Attempts:
# Packages ----
library(corrr)
library(Hmisc)
library(disk.frame)
library(data.table)
# Initialise parallel processing backend
setup_disk.frame()
# Enable large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
# test_DT is a data.table of ~18000 columns and ~800 rows
# of type `num` (`double`)
# Create filebacked disk.frame ----
test_DT_df <- as.disk.frame(
test_DT,
outdir = file.path(tempdir(), "test_tmp.df"),
nchunks = recommend_nchunks(test_DT, conservatism = 4),
overwrite = TRUE
)
# `Hmisc` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
cmap(
.x = test_DT_df,
.f = function(.x) {
Hmisc::rcorr(
x = as.matrix(.x),
type = "pearson"
)
}
),
overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)
# `corrr` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
cmap(
.x = test_DT_df,
.f = function(.x) {
corrr::correlate(
x = .x,
use = "pairwise.complete.obs",
method = "pearson"
)
}
),
overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)
# Cleanup ----
delete(test_DT_df)
delete(test_cor)
rm(test_DT_df, test_cor, test_cor_collect)
gc()