5

I've got access to a big, powerful cluster. I'm a halfway decent R programmer, but totally new to shell commands (and terminal commands in general besides basic things that one needs to do to use ubuntu).

I want to use this cluster to run a bunch of parallel processes in R, and then I want to combine them. Specifically, I have a problem analogous to:

my.function <-function(data,otherdata,N){
    mod = lm(y~x, data=data)
    a = predict(mod,newdata = otherdata,se.fit=TRUE)
    b = rnorm(N,a$fit,a$se.fit)
    b
    }

r1 = my.function
r2 = my.function
r3 = my.function
r4 = my.function
...
r1000 = my.function

results = list(r1,r2,r3,r4, ... r1000)

The above is just a dumb example, but basically I want to do something 1000 times in parallel, and then do something with all of the results from the 1000 processes.

How do I submit 1000 jobs simultaneously to the cluster, and then combine all the results, like in the last line of the code?

Any recommendations for well-written manuals/references for me to go RTFM with would be welcome as well. Unfortunately, the documents that I've found aren't particularly intelligible.

Thanks in advance!

generic_user
  • 3,430
  • 3
  • 32
  • 56
  • Check the `parallel` package. – krlmlr Jan 27 '13 at 23:04
  • If you are using a supercomputer, it's likely using [OpenMPI](http://www.open-mpi.org/) or something similar. If this is the case, you have to use something like [snow](http://cran.r-project.org/web/packages/snow/index.html) directly (though the new parallel library may support MPI). I have instructions on how to set up R for use on the Ohio Supercomputer [here](http://leftcensored.skepsi.net/2011/03/28/using-r-and-snow-on-ohio-supercomputer-centers-glenn-cluster/) and a simple intro to snow [here](http://leftcensored.skepsi.net/2011/04/02/a-very-short-and-unoriginal-introduction-to-snow/). – Jason Morgan Jan 28 '13 at 02:24
  • @ACD It would help us answer your question if you could provide more information about what type of system you are using and what software it's running. – Jason Morgan Jan 29 '13 at 16:13

3 Answers3

5

You can combine plyr with doMC package (that is a parallel backend to the foreach package) as follows:

require(plyr)
require(doMC)
registerDoMC(20) # for 20 processors

llply(1:1000, function(idx) {
    out <- my.function(.)
}, .parallel = TRUE)

Edit: If you're talking about submitting simultaneous jobs, then don't you have a LSF license? You can then use bsub to submit as many jobs as you need and it also takes care of load-balancing and what not...!

Edit 2: A small note on load-balancing (example using LSF's bsub):

What you mention is something similar to what I wrote here => LSF. You can submit jobs in batches. For ex: using in LSF you can use bsub to submit a job to the cluster like so:

bsub -m <nodes> -q <queue> -n <processors> -o <output.log> 
     -e <error.log> Rscript myscript.R

and this will place you on the queue and allocate for you the number of processors (if and when available) your job will start running (depending on resources). You can pause, restart, suspend your jobs.. and much much more.. qsub is something similar to this concept. The learning curve maybe a bit steep, but it is worth it.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • This worked, thanks! As for LSF and bsub, I don't know what you're talking about and that wikipedia link is incomprehensible to me, as of now. Still a huge noob! But less of one than I was a couple of days ago. – generic_user Jan 29 '13 at 07:50
  • Well, actually, I upvoted your answer because it was an immediate help. But it turns out there is more to it. Apparently I need to figure out how to make my R scripts into "jobs" using something called "PBS" and some command called "qsub." doMC is not really getting me there, and it almost seems like the computer admins have throttled me? It was faster yesterday. – generic_user Jan 29 '13 at 15:45
  • @ACD Also look at the links I provide in my comment to your question. I have instructions on how to set up a PBS file, etc., for a supercomputer running MPI. – Jason Morgan Jan 29 '13 at 16:11
5

We wrote a survey paper on State of of the Art in Parallel Computing with R in the Journal of Statistical Software (which is an open journal). You may find this useful as an introduction.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
2

Message Passing Interface do what you want to do, and is very easy to do it. after compiled, you need to run :

mpirun -np [no.of.process] [executable]

you select where to run it with a simple text file with host ip fields like:

node01   192.168.0.1
node02   192.168.0.2
node03   192.168.0.3

here more examples of MPI.