I need some help with massive usage of memory by the NbClust function.
On my data, memory balloons to 56GB at which point R crashes with a fatal error. Using debug()
, I was able to trace the error to these lines:
if (any(indice == 23) || (indice == 32)) {
res[nc - min_nc + 1, 23] <- Index.sPlussMoins(cl1 = cl1,
md = md)$gamma
Debugging of Index.sPlussMoins revealed that the crash happens during a for loop. The iteration that it crashes at varies, and during the loop memory usage varies between 41 and 57Gb (I have 64 total):
for (k in 1:nwithin1) {
s.plus <- s.plus + (colSums(outer(between.dist1,
within.dist1[k], ">")))
s.moins <- s.moins + (colSums(outer(between.dist1,
within.dist1[k], "<")))
print(s.moins)
}
I'm guessing that the memory usage comes from the outer()
function.
Can I modify NbClust to be more memory efficient (perhaps using the bigmemory package)?
At very least, it would be nice to get R to exit the function with an "cannot allocate vector of size..." instead of crashing. That way I would have an idea of just how much more memory I need to handle the matrix causing the crash.
Edit: I created a minimal example with a matrix the approximate size of the one I am using, although now it crashes at a different point, when the hclust function is called:
set.seed(123)
cluster_means = sample(1:25, 10)
mlist = list()
for(cm in cluster_means){
name = as.character(cm)
m = data.frame(matrix(rnorm(60000*60,mean=cm,sd=runif(1, 0.5, 3.5)), 60000, 60))
mlist[[name]] = m
}
test_data = do.call(cbind, cbind(mlist))
library(NbClust)
debug(fun = "NbClust")
nbc = NbClust(data = test_data, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 30,
method = "ward.D2", index = "alllong", alphaBeale = 0.1)
debug: hc <- hclust(md, method = "ward.D2")
It seems to crash before using up available memory (according to my system monitor, 34Gb is being used when it crashes out of 64 total.
So is there any way I can do this without sub-sampling manageable sized matrices? And if I did, how do I know how much memory I will need for a matrix of a given size? I would have thought my 64Gb would be enough.
Edit: I tried altering NbClust to use fastcluster instead of the stats version. It didn't crash, but did exit with a memory error:
Browse[2]>
exiting from: fastcluster::hclust(md, method = "ward.D2")
Error: cannot allocate vector of size 9.3 Gb