1

I am working with a large data set, which I use to make certain calculations. Since it is a huge data set, my machine, I am working on, is doing the job excessively long, for this reason I decided to use the future package in order to distribute the work between several machines and speed up the calculations. So, my problem is that through the future (using putty & ssh) I can connect to those machines (in parallel), but the work itself is doing the main one, without any distribution. Maybe you can advice some solution:

  • How to make it work in all machines;
  • As well, how to check if the process is working (I mean some function or anything that could help to verify the functionment functionality of those, ofc if it's existing).

My code:

library(future)
workers <- c("000.000.0.000", "111.111.1.111")
plan(remote, envir = parent.frame(), workers= workers, myip = "222.222.2.22")
start <- proc.time()
cl <- makeClusterPSOCK(
 c("000.000.0.000", "111.111.1.111"), user = "...", 
rshcmd = c("plink", "-ssh", "-pw",  "..."),  
rshopts = c("-i", "V:\\vbulavina\\privatekey.ppk"),
homogeneous = FALSE))
setwd("V:/vbulavina/r/inversion")
a <- source("fun.r")
f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})
time_elapsed_parallel <- proc.time() - start
time_elapsed_parallel

f and l objects are supposed to be done in parallel, but the master machine is doing all the job, so I'm a bit confused if i can do something concerning it.

PS: I tried plan() with remote, multiprocess, multisession, cluster and nothing.

PS2: my local machine is Windows and try to connect to Kubuntu and Debian (firewall is off in all of those).

Thnx in advance.

Axeman
  • 32,068
  • 8
  • 81
  • 94
  • @Axeman the thing is that plan code is doing nothing to me, bc without it there's the connection, but no distribution between the machines – zoidberg724 Jul 25 '18 at 07:30
  • @Axeman so yes, `plan(remote, envir = parent.frame(), workers= workers, myip = "192.168.2.48")` I tried this and got an error with `Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : reached elapsed time limit` – zoidberg724 Jul 25 '18 at 07:47
  • @Axeman oh sorry, you're right! – zoidberg724 Jul 25 '18 at 07:50
  • @Axeman so the `workers` just the machine's IP I suppose to use for the connection – zoidberg724 Jul 25 '18 at 07:56

1 Answers1

0

Author of future here. First, make sure you can setup the PSOCK cluster, i.e. connect to the two workers over SSH and run Rscript on them. This you do as:

library(future)
workers <- c("000.000.0.000", "111.111.1.111")
cl <- makeClusterPSOCK(workers, user = "...",
                       rshcmd = c("plink", "-ssh", "-pw",  "..."),
                       rshopts = c("-i", "V:/vbulavina/privatekey.ppk"),
                       homogeneous = FALSE)
print(cl)
### socket cluster with 2 nodes on hosts '000.000.0.000', '111.111.1.111'

(If the above makeClusterPSOCK() stalls or doesn't work, add argument verbose = TRUE to get more info - feel free to report back here.)

Next, with the PSOCK cluster set up, tell the future system to parallelize over those two workers:

plan(cluster, workers = cl)

Test that futures are actually resolved remotes, e.g.

f <- future(Sys.info()[["nodename"]])
print(value(f))
### [1] "000.000.0.000"

I leave the remaining part, which also needs adjustments, for now - let's make sure to get the workers up and running first.

Continuing, using source() in parallel processing complicates things, especially when the parallelization is done on different machines. For instance, calling source("my_file.R") on another machine requires that the file my_file.R is available on that machine too. Even if it is, it also complicates things when it comes to the automatic identification of variables that need to be exported to the external machine. A safer approach is to incorporate all the code in the main script. Having said all this, you can try to replace:

f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})

with

futureSource <- function(file, envir = parent.frame(), ...) {
  expr <- parse(file)
  future(expr, substitute = FALSE, envir = envir, ...)
}

f <- futureSource("pasos.r")
l <- futureSource("pasos2.R")

As long as pasos.r and pasos2.R don't call source() internally, this c/should work.

BTW, what version of Windows are you on? Because with an up-to-date Windows 10, you have built-in support for SSH and you no longer need to use PuTTY.

UPDATE 2018-07-31: Continue answer regarding using source() in futures.

HenrikB
  • 6,132
  • 31
  • 34
  • Thanks a lot, as you said, I got yesterday the connection between the machines, with the `verbose = TRUE`, and by obtaining private keys from the ubuntu machines and sharing them with the windows one (btw it's windows 7). The error that occurred after, was about changing the `plan(multiprocess)`, I tried to change the plan, but it gave me the error that I can't use it with the integer, if I'm not mistaken, bc can't say for sure, as I don't have my compute in front. So that's the thing, there's connection 100%, but but the code inside of the `future` not getting executed. – zoidberg724 Jul 27 '18 at 06:50
  • (1) You should not need to share private SSH keys across machines - only public ones. (2) The `plan(cluster, workers = cl)` example I give above is how you set up the two workers; that's the only `plan()` you should need. If you want to run, say, four workers on each of those two machines, use `workers <- rep(c("000.000.0.000", "111.111.1.111"), each = 4L)`. (3) Yes, I don't expect your `future(source(...))` calls to work, but let's talk about that when you've confirmed that it works for you with the `plan()` I suggest. – HenrikB Jul 27 '18 at 06:58
  • Okay, thanks. With the public one it didn't work. I tried several times and obtained the connection with the private one. – zoidberg724 Jul 27 '18 at 07:36
  • Hmm..kay. Please confirm that `f <- future(Sys.info()[["nodename"]])` and `value(f)` work as in my example. – HenrikB Jul 27 '18 at 09:09
  • Sorry for the delay, i had some problems appeared. So, `f <- future(Sys.info()` and `value(f)` worked, the result I got was `"cluster01"`. – zoidberg724 Jul 31 '18 at 06:18
  • the only thing, after I launched `plan(cluster, workers = cl)`, it gave me a warning `closing unused connection 6 (<-BLAHBLAH)` and after I launched `value(f)` and it gave me the name of the cluster – zoidberg724 Jul 31 '18 at 06:37
  • The warning "closing unused connection 6 (<-BLAHBLAH)" is produced when the R garbage collector cleans up a no longer used, previously created, cluster object. The garbage collector probably ran when you recreated `cl` making the old one available for cleanup. – HenrikB Jul 31 '18 at 10:27
  • I've updated my answer address the remaining part of your script where you use `future(source())`. – HenrikB Jul 31 '18 at 10:38
  • I tried your updated variant and it's not working either, maybe I can copy-paste the code I have inside `pasos.R/2` inside the future fuction and this could work? Or better send the files into another machines and this could help?! – zoidberg724 Jul 31 '18 at 11:37
  • the one to put the code inside the package did't work. – zoidberg724 Jul 31 '18 at 11:42
  • Start out making sure it works in the current R session without parallelization by using `plan(sequential)`. Then move on to parallelize on your local computer using `plan(multiprocess)`. That should give you some to work with. Only then, worry about getting it to work on external machines. PS. Saying just "not working" and "didn't work" is not very helpful when asking for help - that is extremely little information to work with. – HenrikB Jul 31 '18 at 19:48
  • by just "not working" and "didn't work" I meant, that after executing the code, nothing happened, not a single process, no results, that should have had appeared, only new command line. – zoidberg724 Aug 01 '18 at 12:42
  • Ok... at least I think the future calls return, i.e. variables `f` and `l` are created. To get the result of futures, you also need to call `value(f)` and `value(l)`. Those will block until the futures are resolved. – HenrikB Aug 01 '18 at 14:12
  • I understood why it "didn't work" and etc, bc the future one obtained a connection between machines only under R version 3.4.2 (have no idea why so, mostly sure bc of the system work I have), but inside the code I need to launch there're packacges built on the version 3.4.4, that's mismatch caused "not working" and "didn't work" status in addition to the appearance of new command line. But this is already not the package problem, so thank you very much for your attention! Wish you all the best and really appreciate your collaboration!! – zoidberg724 Aug 02 '18 at 12:44