4

I ran into an issue trying to use %dopar% and foreach() together with an R6 class. Searching around, I could only find two resources related to this, an unanswered SO question and an open GitHub issue on the R6 repository.

In one comment (i.e., GitHub issue) an workaround is suggested by reassigning the parent_env of the class as SomeClass$parent_env <- environment(). I would like to understand what exactly does environment() refer to when this expression (i.e., SomeClass$parent_env <- environment()) is called within the %dopar% of foreach?

Here is a minimal reproducible example:

Work <- R6::R6Class("Work",

    public = list(
        values = NULL,


        initialize = function() {
            self$values <- "some values"
        }
    )
)

Now, the following Task class uses the Work class in the constructor.

Task <- R6::R6Class("Task",
    private = list(
        ..work = NULL
    ),


    public = list(
        initialize = function(time) {
            private$..work <- Work$new()
            Sys.sleep(time)
        }
    ),


    active = list(
        work = function() {
            return(private$..work)
        }
    )
)

In the Factory class, the Task class is created and the foreach is implemented in ..m.thread().

Factory<- R6::R6Class("Factory",

    private = list(
        ..warehouse = list(),
        ..amount = NULL,
        ..parallel = NULL,


        ..m.thread = function(object, ...) {
            cluster <- parallel::makeCluster(parallel::detectCores() -  1)
            doParallel::registerDoParallel(cluster)

            private$..warehouse <- foreach::foreach(1:private$..amount, .export = c("Work")) %dopar% {
                # What exactly does `environment()` encapsulate in this context?
                object$parent_env <- environment()
                object$new(...) 
            }

            parallel::stopCluster(cluster)
        },


        ..s.thread = function(object, ...) {
            for (i in 1:private$..amount) {
               private$..warehouse[[i]] <- object$new(...)
            }
        },


        ..run = function(object, ...) {
            if(private$..parallel) {
                private$..m.thread(object, ...)
            } else {
                private$..s.thread(object, ...)
            }
        }
    ),


    public = list(
        initialize = function(object, ..., amount = 10, parallel = FALSE) {
            private$..amount = amount
            private$..parallel = parallel

            private$..run(object, ...)
        }
    ),


    active = list(
        warehouse = function() {
            return(private$..warehouse)
        }
    )
)

Then, it is called as:

library(foreach)

x = Factory$new(Task, time = 2, amount = 10, parallel = TRUE)

Without the following line object$parent_env <- environment(), it throws an error (i.e., as mentioned in the other two links): Error in { : task 1 failed - "object 'Work' not found".

I would like to know, (1) what are some potential pitfalls when assigning the parent_env inside foreach and (2) why does it work in the first place?


Update 1:

  • I returned environment() from within foreach(), such that private$..warehouse captures those environments
  • using rlang::env_print() in a debug session (i.e., the browser() statement was placed right after foreach has ended execution) here is what they consist of:
Browse[1]> env_print(private$..warehouse[[1]])

# <environment: 000000001A8332F0>
# parent: <environment: global>
# bindings:
#  * Work: <S3: R6ClassGenerator>
#  * ...: <...>

Browse[1]> env_print(environment())

# <environment: 000000001AC0F890>
# parent: <environment: 000000001AC20AF0>
# bindings:
#  * private: <env>
#  * cluster: <S3: SOCKcluster>
#  * ...: <...>

Browse[1]> env_print(parent.env(environment()))

# <environment: 000000001AC20AF0>
# parent: <environment: global>
# bindings:
#  * private: <env>
#  * self: <S3: Factory>

Browse[1]> env_print(parent.env(parent.env(environment())))

# <environment: global>
# parent: <environment: package:rlang>
# bindings:
#  * Work: <S3: R6ClassGenerator>
#  * .Random.seed: <int>
#  * Factory: <S3: R6ClassGenerator>
#  * Task: <S3: R6ClassGenerator>
Mihai
  • 2,807
  • 4
  • 28
  • 53
  • I've have bad luck trying to get objects with environments to be usable across nodes of a `parallel` clusters. R6 objects are [inherently environments](https://r6.r-lib.org/articles/Introduction.html), which are often used to accomplish *pass-by-reference* semantics (instead of R's default *pass-by-value*). In order to do that, the `environment` is modified in-place. Unfortunately, this env is not shared across cluster nodes, so even if an `environment` can be transferred to other nodes, often the premise of the object can be lost. (I don't know that the env can be transferred, btw.) – r2evans Aug 04 '19 at 21:06
  • 1
    Reading that github issue, it is entirely possible I'm missing something ... – r2evans Aug 04 '19 at 21:09
  • I thought the same until I read the GitHub issue! Now I think it is possible, at least the `object$parent_env <- environment()` makes it possible. Still, I don't get the reason behind... – Mihai Aug 04 '19 at 21:14
  • Reasoning about it, I expected that if I include `self` in the `.export` this would also work, but it didn't: `object$parent_env <- parent.env(self$.__enclos_env__)`. – Mihai Aug 04 '19 at 21:19
  • Could you please make your example more minimal? – F. Privé Aug 05 '19 at 05:31
  • Hi @F.Privé, sure, [here is a GitHub Gist containing a more minimal example](https://gist.github.com/mihaiconstantin/df755017643ff08d8ad32d51db42cf59). Does it help? – Mihai Aug 05 '19 at 06:44

1 Answers1

2

Disclaimer: a lot of what I say here are educated guesses and inferences based on what I know, I can't guarantee everything is 100% correct.

I think there can be many pitfalls, and which one applies really depends on what you do. I think your second question is more important, because if you understand that, you'll be able to evaluate some of the pitfalls by yourself.

The topic is rather complex, but you can probably start by reading about R's lexical scoping. In essence, R has a sort of hierarchy of environments, and when R code is executed, variables whose values are not found in the current environment (which is what environment() returns) are sought in the parent environments (not to be confused with the caller environments).

Based on the GitHub issue you linked, R6 generators save a "reference" to their parent environments, and they expect that everything their classes may need can be found in said parent or somewhere along the environment hierarchy, starting at that parent and going "up".

The reason the workaround you're using works is because you're replacing the generator's parent environment with the one in the current foreach call inside the parallel worker (which may be a different R process, not necessarily a different thread), and, given your .export specification probably exports necessary values, R's lexical scoping can then search for missing values starting from the foreach call in the separate thread/process.

For the specific example you linked, I found that a simpler way to make it work (at least on my Linux machine) is to do the following:

library(doParallel)

cluster <- parallel::makeCluster(parallel::detectCores() -  1)
doParallel::registerDoParallel(cluster)
parallel::clusterExport(cluster, setdiff(ls(), "cluster"))

x = Factory$new(Task, time = 1, amount = 3)

but leaving the ..m.thread function as:

..m.thread = function(object, amount, ...) {
    private$..warehouse <- foreach::foreach(1:amount) %dopar% {
        object$new(...) 
    }
}

(and manually call stopCluster when done).

The clusterExport call should have semantics similar to*: take everything from the main R process' global environment except cluster, and make it available in each parallel worker's global environment. That way, any code inside the foreach call can use the generators when lexical scoping reaches their respective global environments. foreach can be clever and exports some variables automatically (as shown in the GitHub issue), but it has limitations, and the hierarchy used during lexical scoping can get very messy.

*I say "similar to" because I don't know what exactly R does to distinguish (global) environments if forks are used, but since that export is needed, I assume they are indeed independent of each other.

PS: I'd use a call to on.exit(parallel::stopCluster(cluster)) if you create workers inside a function call, that way you avoid leaving processes around until they are somehow stopped if an error occurs.

Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Hi @Alexis, thank you for such an elaborate answer and the resource on lexical scoping. It really helps me understand better what's going on and also how to use the search path to my advantage. P.S. The `on.exit` saves a lot of warnings for cluster-related connections... I will go ahead and mark it as the accepted answer. – Mihai Aug 07 '19 at 04:45
  • @Mihai no problem. Another advice: if you can put all your generators in your own R package, you can then tell `foreach` to load that package in each worker, and it'd probably save you some problems. You'd have to stop passing the generators directly to it, but you could pass a generator name and use `object <- get(...` or something similar inside the `foreach` call. – Alexis Aug 07 '19 at 05:36
  • This seems like a very elegant solution! Indeed, all the generators live in the same package namespace and the sole purpose of the factory is to create objects from them. Thanks, I've learn a few very valuable things from you! – Mihai Aug 07 '19 at 06:28