R: clarification on memory management

Question

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.

So, I have code that looks something like this:

foo = function (bigm, inTrain, moreParamsList) {
  parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
  do.call(svm, parsList)
}

What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?

Edit:

Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?

Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?

Thanks.

score 1 · Answer 1 · answered Oct 17 '13 at 08:20

I am just going to put in here what I find from my research on this topic:

I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:

In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.

score 1 · Answer 2 · answered Oct 17 '13 at 09:08

For your first part of the question, you can use tracemem :

This function marks an object so that a message is printed whenever the internal code copies the object

Here an example:

a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a        ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298]   ## object a is copied twice 
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]   
untracemem(a)

Thanks for this answer. Just thought it will be better to mention that this is only available if R was compiled `with --enable-memory-profiling`. — asb, Oct 17 '13 at 09:58

score 1 · Answer 3 · edited May 23 '17 at 12:05

You already found from the manual that mclapply isn't supposed to make copies of bigm. But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.

If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.

The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.

R: clarification on memory management

3 Answers3