0

I'm having an issue where an R function (NbCluster) crashes R, but at different points on different runs with the same data. According to journalctl, the crashes are all because of memory issues. For example:

Sep 04 02:00:56 Q35 kernel: [   7608]  1000  7608 11071962 10836497 87408640        0             0 rsession
Sep 04 02:00:56 Q35 kernel: Out of memory: Kill process 7608 (rsession) score 655 or sacrifice child
Sep 04 02:00:56 Q35 kernel: Killed process 7608 (rsession) total-vm:44287848kB, anon-rss:43345988kB, file-rss:0kB, shmem-rss:0kB
Sep 04 02:00:56 Q35 kernel: oom_reaper: reaped process 7608 (rsession), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I have been testing my code to figure out which lines are causing the memory errors, and it turns out that it varies, even using the same data. Aside from wanting to solve it, I am confused as to why this is a intermittent problem. If an object is too big to fit in memory, it should be a problem every time I run it given the same resources, right?

The amount of memory being used by other processes was not dramatically different between runs, and I always started from a clean environment. When I look at top I always have memory to spare (although I am rarely looking at the exact moment of the crash). I've tried reducing the memory load by removing unneeded objects and regular garbage collection, but this has had no discernable effect.

For example, when running NbClust, sometimes the crash occurs while running length(eigen(TT)$value) other times it happens during a call of hclust. Sometimes it doesn't crash and exits with a comparatively graceful "cannot allocate vector size" Aside from any suggestions about reducing memory load, I want to know why I am running out of memory some times but not others.

Edit: After changing all uses of hclust to hclust.vector, I have not had any more crashes during the hierarchical clustering steps. However there are still crashes going on at varying places (often during calls of eigen()). If I could reliably predict (within a margin of error) how much memory each line of my code was going to use, that would be great.

Stonecraft
  • 860
  • 1
  • 12
  • 30
  • First question: Is the algorithm fully deterministic or is an RNG involved? Then, a quick look indicates that you are doing clustering and a distance measure is involved. Such approaches often don't scale well for large data. You probably need a smarter or memory optimized algorithm or do this with a subset of your data. – Roland Sep 05 '19 at 06:33
  • It is not deterministic, but I am setting the seed, so that should be the same every time, right? I am also editing the functions in question to use the fastcluster package, which I hope will be sufficient. But it's hard to effectively test something if the problem isn't happening consistently. – Stonecraft Sep 05 '19 at 07:26

1 Answers1

1

Modern memory management is by far not as deterministic as you seem to think it is.

If you want more reproducible results, make sure to get rid of any garbage collection, any parallelism (in particular garbage collection running in parallel with your program!) and make sure that the process is limited in memory by a value much less than your free system memory.

The kernel OOM killer is a measure of last resort when the kernel has overcommitted memory (you may want to read what that means), is completely out of swap storage, and cannot fulfill it's promises.

The kernel can allocate memory that doesn't need to exist until it is first accessed. Hence, the OOM killer can occur not on allocation, but when the page is actually used.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Why get rid of garbage collection? If it happens in the same place in the code, why does it have a different effect from run to run? Is there a good way to record the memory usage in the moments before the crash? If I could do that, in conjunction with Rprof(), maybe I could at least see what objects are causing these sudden memory spikes. – Stonecraft Sep 05 '19 at 18:30
  • 1
    GC usually is quite non-deterministic, asynchronous, parallel, etc. It does *not* happen in the same place. – Has QUIT--Anony-Mousse Sep 06 '19 at 00:39
  • Is there any way that I can at least correlate each line run by R with memory use as measured by my system? Because I'm not seeing anything in Rprof() that is using up anywhere near the max available. – Stonecraft Sep 06 '19 at 01:43
  • 1
    As mentioned above, the operating system will overcommit (unless you turn that off) because applications allocate memory but never use it. Hence it is only allocated by the kernel when it is actually first used. So no, you often won't just be able to see where it fails. Also, allocation does not happen in bytes, but in memory pages. Plus, your problems likely don't occur in R Code, but in the underlying C and Fortran code. – Has QUIT--Anony-Mousse Sep 06 '19 at 06:01
  • Wouldn't knowing which call of underlying code is causing memory use to spike be a good start? – Stonecraft Sep 06 '19 at 07:40
  • Since the OOMKiller is triggered by memory access, not by allocation ("spikes"), no. The kernel OOMKillet can be triggered by other processes, and can kill other processes than the one requesting memory. So it could be a trivial `ls` in a shell that causes a kernel page allocation failure, and hence the OOMKiller to kill your process because that is most likely to make the system usable again. Memory management is much more complicated than a simple allocation failure... – Has QUIT--Anony-Mousse Sep 06 '19 at 17:50
  • Even if it is a simple ls() that triggers the actual failure, why wouldn't it help to know exactly which function acting on which object is causing such high memory use? – Stonecraft Sep 06 '19 at 23:12
  • Might it be possible to use the `bigmemory` package and adapt the functions in question to use it? – Stonecraft Sep 07 '19 at 00:58
  • I was talking about the Unix shell command `ls` as an example for a trivial external program that also briefly may allocate some memory, but is entirely unrelated to your program. There also is your desktop environment, maybe your web browser, ... – Has QUIT--Anony-Mousse Sep 07 '19 at 07:26
  • I doubt *you* can "just" rewrite hclust (which needs O(n²) memory, so this likely is where you run out of memory) to `bigmemory` because it's not R but some old Fortran or C function. However, there is no good reason to use hclust when using Mclust (or any other variant of GMM aka EM) on large data sets. Have you tried using other tools such as ELKI (it's called clustering.EM or so there). – Has QUIT--Anony-Mousse Sep 07 '19 at 07:32
  • No, I'm not trying to rewrite `hclust`. I'm using an alternate version of `hclust` that is available from the repos (`fastcluster`) and substituting that in the `NbClust` function's calls of hclust. As noted in my edit, substituting `hclust.vector` in place of hclust seems to have eliminated crashes at that step. Now the crashes are happening with `eigen`. Anyway, my matrix is 20k x 20k, so it's big but shouldn't be too big for the (minimum) 50Gb RAM free for R at any time, should it? – Stonecraft Sep 07 '19 at 08:11
  • 1
    20kx20kx8bx2=6.4gb so that should work unless you have wasted a lot of memory in other places already. An eigenvector decomposition of a matrix is O(d³) time and needs O(d²) memory. So the number of variables als matters. But it should work unless the code you use does some really weird things. Are you sure that is what you want to use? Maybe you have too many variables for GMM anyway?!? Have you tried alternatives such as EM from ELKI or sklearn? Maybe their code is better? See: any "accidental" copy of the matrix is another 3.2gb. If the code you use makes one copy per cluster... – Has QUIT--Anony-Mousse Sep 07 '19 at 10:04
  • Well, my thought is that it does does do something weird as memory use by the rsession process balloons to > 54GB, and does so quite suddenly. Which is why I am so intent on figuring out exactly which function is running when memory use increases like this. I was thinking to try sklearn since it implements Gap statistic and Python has better memory management. – Stonecraft Sep 08 '19 at 06:16
  • How many features do you have? How many clusters? Gap statistic may only work for kmeans and low-dimensional data, not GMM. I don't think sklearn is particularly good at memory management. If you run sklearn k-means on a sparse matrix, it will make a dense copy, and there doesn't seem to be anything you can do about this. – Has QUIT--Anony-Mousse Sep 08 '19 at 08:09
  • There are 20,000 observations of between 20-100 features. The 20,000 x 20,000 matrix is created as an intermediate when eigenstuff is going on. – Stonecraft Sep 08 '19 at 09:49
  • Mclust needs a fixed dimensionality of continuous variables, doesn't it? How can it be 20-100? – Has QUIT--Anony-Mousse Sep 08 '19 at 23:03
  • Sorry, I should have been clear that these are different matrices. All the matrices have 20,000 observations, but range between 20 and 100 features. – Stonecraft Sep 09 '19 at 02:45
  • 1
    Do they all fail the same way? Because it is of interest if the number of features matters. 100 with 20k samples likely already is too much to get reliable Gaussian models. But 20 features may still work, if they behave nicely. With discrete or binary features, weird things will happen. – Has QUIT--Anony-Mousse Sep 09 '19 at 05:47
  • They are binary features. The one with 100 features is 50 traits that are present or absent. But they always fail due to out-of-memory. I think part of the problem might be memory fragmentation. I've noticed that after a fresh restart of my VM, it usually fails at a later point/a matrix with more features. – Stonecraft Sep 09 '19 at 18:22
  • I'm marking this as answered, although most of the info I was looking for was in these comments. – Stonecraft Sep 09 '19 at 18:23
  • *Gaussian* mixture modeling assumes normal distributed data. That is the wrong model for binary features. – Has QUIT--Anony-Mousse Sep 09 '19 at 22:46
  • Sorry, I was super-unclear there, I got mixed up and was talking about two different steps of my pipeline at once. The data I am running the clustering on is continuous between -1 and +1, they are the coefficients of a linear model that had binary features as the dependent variable. – Stonecraft Sep 10 '19 at 04:41
  • I wouldn't expect clustering coefficients to cluster much, except around 0. So again, I'd choose a different modeling approach build around the null model of no correlation. – Has QUIT--Anony-Mousse Sep 10 '19 at 06:29
  • No, the linear coefficients are for membership in experimental batches. – Stonecraft Sep 10 '19 at 19:39