3

I have a function like the following:

fxn <- function(X) {
    data <- replicate(10, rnorm(10000)) 
    clusters <- kmeans(data, X)
    write.csv(clusters$cluster, paste0("kmeans", X, ".csv"))}

I want to use mclapply to iterate it in parallel.

list <- list(10, 50, 100, 150, 200, 250, 300)
mclapply(list, fxn, mc.cores = 8)

This is a very simplified version of my function and use-case, but I want to use it to clarify how environments are handled when using a user-defined function and mclapply.

Because this is being processed in parallel on the same RAM, I was wondering whether the mclapply function could get "confused" at some point and mix up either data or clusters for a different parameter (as defined in list)(by overwriting data and clusters and using the variable which was made using the wrong X). I am aware that each function maintains its own environment, but as the same function is being used several times at once, I want to confirm how this works.

I would really appreciate it if you could clarify this for me or point me in the right direction.

Thanks!

Keshav M
  • 1,309
  • 1
  • 13
  • 24
Jack Arnestad
  • 1,845
  • 13
  • 26
  • 1
    Q: `mclapply function could get "confused" at some point`? A: No. The environment is cloned for each parallel worker - each entry in `list` is passed to a parallel worker. *N* (core)-copies of `data` & `clusters` are generated. – CPak Mar 14 '18 at 20:42
  • Does this mean that I can add `rm(data); gc()` and it would not cause any complications (I know for this particular example it doesn't, but I am wondering if that is true for all cases). – Jack Arnestad Mar 17 '18 at 01:09
  • I can't say for *all* cases. But yes, you can `rm(data); gc()` in each parallel worker - but keep in mind, when each parallel worker is finished, it will attempt to clean up the environment itself, so I don't think you need to explicitly do this. – CPak Mar 17 '18 at 16:25

0 Answers0