SGE array jobs and R

Question

I currently have a R script written to perform a population genetic simulation, then write a table with my results to a text file. I would like to somehow run multiple instances of this script in parallel using an array job (my University's cluster uses SGE), and when its all done I will have generated results files corresponding to each job (Results_1.txt, Results_2.txt, etc.).

Spent the better part of the afternoon reading and trying to figure out how to do this, but haven't really found anything along the lines of what I am trying to do. I was wondering if someone could provide and example or perhaps point me in the direction of something I could read to help with this.

Vince · Accepted Answer · 2015-02-24T21:54:27.403

5

To boil down mithrado's answer to the bare essentials:

Create job script, pop_gen.bash, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:

#!/bin/bash
Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt

Submit this script as a job array, e.g. 1000 jobs:

qsub -t 1-1000 pop_gen.bash

Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.

Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R you can use SGE_TASK_ID to write to output file:

args <- commandArgs(trailingOnly = TRUE)
out.file <- paste("Results_", args[1], ".txt", sep="")
# d <- "some data frame"
write.table(d, file=out.file)

HTH

edited Feb 24 '15 at 21:54

answered Feb 24 '15 at 14:36

Vince

3,325
2
23
41

I gave this a try, and it seemed it worked up until it was time to write out the table. I did a test using qsub -t 1-3 myscript.bash. Four files were generated, Results_1.txt, Results_2.txt, Results_3.txt, and Results_NA.txt.The first 3 just had some stuff that prints to terminal from my simulation, while Results_NA.txt appeared to have the right information for one of the runs (I think they all created a Results_NA.txt file and there was some over writing). Could you explain a little more about passing the SGE_TASK_ID to R? that appears to be where the issue is. – user3381331 Feb 24 '15 at 21:34
Are you redirecting output to same file on command line as well as in R? I ask because you say that the Results_1.txt file has text from stdout/stderr. If so, do not redirect to file within the job script i.e. remove `> Results_${SGE_TASK_ID}.txt`. Now, for the Results_NA.txt problem, I suspect the argument is not being read into R correctly. In my code, there is error as I wrote `SGE_TASKID` instead of `SGE_TASK_ID`. Maybe that is the problem? – Vince Feb 24 '15 at 21:52
Okay, that cleared it up. Fixed the typo and used a different name for the table so I could have the terminal output as well as my table. Thanks a lot for the help! – user3381331 Feb 24 '15 at 22:04
I wonder why this was the accepted answer if mine was better explained and more general. – tbrittoborges Feb 25 '15 at 15:42

score 1 · Answer 2 · answered Feb 24 '15 at 12:10

I am not used to do this in R, but I've been using the same approach in python. Imagine that you have an script genetic_simulation.r and it has 3 parameter: --gene_id --khmer_len and --output_file.

You will have one csv file, genetic_sim_parms.csv with n rows:

first_gene,10,/result/first_gene.txt
...
nth_gene,6,/result/nth_gene.txt

A import detail is the first lane of your genetic_simulation.r. It needs to tell which executable the cluster is going to will use. You might need to tweak its parameters as well, depending on your setup, it will look like to:

#!/path/to/Rscript --vanilla

And finally, you will need a array-job bash script:

#!/bin/bash
#$ -t 1:N < change to number of rows in genetic_sim_parms.csv
#$ -N genetic_simulation.r 

echo "Starting on : $(date)"
echo "Running on node : $(hostname)"
echo "Current directory : $(pwd)"
echo "Current job ID : $JOB_ID"
echo "Current job name : $JOB_NAME"
echo "Task index number : $SGE_TASK_ID"
ID=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $1}' genetic_sim_parms.csv)
LEN=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $2}' genetic_sim_parms.csv)
OUTPUT=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $3}' genetic_sim_parms.csv)

echo "id is: $ID"
rscript genetic_simulation.r --gene_id $ID --khmer_len $LEN --output_file $OUTPUT
echo "Finished on : $(date)"

Hope this helps!

SGE array jobs and R

2 Answers2