1

I need some help with massive usage of memory by the NbClust function. On my data, memory balloons to 56GB at which point R crashes with a fatal error. Using debug(), I was able to trace the error to these lines:

            if (any(indice == 23) || (indice == 32)) {
                res[nc - min_nc + 1, 23] <- Index.sPlussMoins(cl1 = cl1, 
                    md = md)$gamma

Debugging of Index.sPlussMoins revealed that the crash happens during a for loop. The iteration that it crashes at varies, and during the loop memory usage varies between 41 and 57Gb (I have 64 total):

    for (k in 1:nwithin1) {
      s.plus <- s.plus + (colSums(outer(between.dist1, 
                                        within.dist1[k], ">")))
      s.moins <- s.moins + (colSums(outer(between.dist1, 
                                          within.dist1[k], "<")))
      print(s.moins)
    }

I'm guessing that the memory usage comes from the outer() function. Can I modify NbClust to be more memory efficient (perhaps using the bigmemory package)? At very least, it would be nice to get R to exit the function with an "cannot allocate vector of size..." instead of crashing. That way I would have an idea of just how much more memory I need to handle the matrix causing the crash.

Edit: I created a minimal example with a matrix the approximate size of the one I am using, although now it crashes at a different point, when the hclust function is called:

set.seed(123)

cluster_means = sample(1:25, 10)
mlist = list()
for(cm in cluster_means){
  name = as.character(cm)
  m = data.frame(matrix(rnorm(60000*60,mean=cm,sd=runif(1, 0.5, 3.5)), 60000, 60))
  mlist[[name]] = m
}

test_data = do.call(cbind, cbind(mlist))

library(NbClust)
debug(fun = "NbClust")
nbc = NbClust(data = test_data, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 30, 
              method = "ward.D2", index = "alllong", alphaBeale = 0.1)
debug: hc <- hclust(md, method = "ward.D2")

It seems to crash before using up available memory (according to my system monitor, 34Gb is being used when it crashes out of 64 total.

So is there any way I can do this without sub-sampling manageable sized matrices? And if I did, how do I know how much memory I will need for a matrix of a given size? I would have thought my 64Gb would be enough.

Edit: I tried altering NbClust to use fastcluster instead of the stats version. It didn't crash, but did exit with a memory error:

Browse[2]> 
exiting from: fastcluster::hclust(md, method = "ward.D2")
Error: cannot allocate vector of size 9.3 Gb
Stonecraft
  • 860
  • 1
  • 12
  • 30
  • Great question! However, we need a reproducible example to see how exactly the memory explodes. Can you please post the whole script? https://stackoverflow.com/help/minimal-reproducible-example – Vitali Avagyan Aug 27 '19 at 23:50
  • OK, I'm making a minimal example using dummy data. I'm not sure about posting the entire NbClust script, it is thousands of lines long and can be read on GitHub. https://github.com/cran/NbClust/blob/master/R/NbClust.R – Stonecraft Aug 28 '19 at 18:20

2 Answers2

1

If you check the source code of Nbclust, you'll see that is all but optimized for speed or memory efficiency.

The crash you're reporting is not even during clustering - it's in the evaluation afterwards, specifically in the "Gamma, Gplus and Tau" index code. Disable these indexes and you may get further, but most likely you'll just have the same problem again in another index. Maybe you can pick only a few indices to run, specifically such indices that so not need a lot of memory?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Commenting out those lines lines did allow the current dataset to run to completion. I think I might have screwed up the structure of output though. It's a pretty complicated function, so mucking about with it like this bothers me. – Stonecraft Aug 28 '19 at 17:18
  • But you are correct, with a larger matrix it crashed at a different point. – Stonecraft Aug 28 '19 at 17:55
  • 1
    You don't need to comment the lines and guess. Study the evaluation methods, and set the parameter to only include those that are O(n) not O(n²). – Has QUIT--Anony-Mousse Aug 29 '19 at 01:01
  • Sorry if this is an overly basic question, but how do I know? Can you point me to a relevant explanation of O-notation in the context of optimizing R functions? – Stonecraft Aug 29 '19 at 02:33
  • You'll need to check the relevant literature. I don't know which of these indices is based on pairwise distances. But it should be easy to see both in code and mathematical definition. If in doubt, don't use it - use only those that seem to be okay. – Has QUIT--Anony-Mousse Aug 29 '19 at 17:09
  • You also cannot use the initialization with hierarchical clustering (`hclust` is O(n²) memory and O(n³) time!), as this is both too slow and needs to much memory. There are alternatives - IIRC, the ELKI version of EM should run in O(nk), hence be more scalable. – Has QUIT--Anony-Mousse Aug 29 '19 at 17:10
0

I forked NbClust and made some changes that seem to have made it go for longer without crashing with bigger matrices. I changed some of the functions to use Rfast, propagate and fastcluster. However there are still problems.

I haven't run all my data yet and only run a few tests on dummy data with gap, so there is still time for it to fail. But any suggestions/criticisms would be welcome. My (in progress) fork of NbCluster: https://github.com/jbhanks/NbClust

Stonecraft
  • 860
  • 1
  • 12
  • 30