3

I'm looking for a way to be able to specify from the command line:

  1. the total number of threads to be used at the same time (even if by multiple jobs)
  2. the maximal number of jobs to run in parallel (which I currently successfully get using --jobs so all is good here).
  3. If the maximum number of threads to be used it higher than the threads specified for a particular rule, the use the minimum between the two for this specific rule.

My rules look like this:

rule a:
    input: "{sample}.in"
    output: "{sample}.out"
    threads: 10
    shell: "some-program --threads {threads}"

rule b:
    input: expand("{sample}.out", sample=SAMPLES)
    output: touch("done.done")
    threads: 1
    shell: "do something"

When I use --cluster to submit my jobs to the cluster and I use a wrapper for qsub, my command line looks like this:

snakemake --cluster "qsub-wrapper --threads {threads}" --jobs N

and hence I specify the number of threads to allocate per job. The --jobs parameter then is interpreted as the number of jobs to submit in parallel to the cluster, but doesn't limit the overall number of threads that will be used.

So for example if I use --jobs 2, then 2 instances of rule a will run in parallel occupying a total of 20 threads.

The solution that I found was to use the --resources, where I added to each rule:

resources: nodes=NUMBER_OF_THREADS

NUMBER_OF_THREADS is simply whatever I defined for the threads, so the example from above would look like this:

rule a:
    input: "{sample}.in"
    output: "{sample}.out"
    threads: 10
    resources: nodes=10
    shell: "some-program --threads {threads}"

rule b:
    input: expand("{sample}.out", sample=SAMPLES)
    output: touch("done.done")
    threads: 1
    resources: nodes=1
    shell: "do something"

And now I run:

snakemake --cluster "qsub-wrapper --threads {threads}" --jobs N --resources nodes=10

Now, even though 2 jobs could be submitted according to --jobs, but only one would be submitted due to the resources.

Is there a better way to do this?

Also, is there a way for me to access the resources variable from within the snakefile? The reason I want to do that is that I now face a different problem: if the resources were lower than the threads for a rule, then that rule is never submitted to the queue, so what I would like to do is something like this:

rule a:
    input: "{sample}.in"
    output: "{sample}.out"
    threads: min(10, command_line_specified_resources.nodes)
    resources: min(10, command_line_specified_resources.nodes)
    shell: "some-program --threads {threads}"

But I haven't found a way to access the command line specified resources (I tried seeing if the workflow object would have that, but I didn't see anything).

Thank you for your help!

  • I haven't played with resources so I can't answer your question, but I understand the --jobs flag differently than you describe it. I think it solves your #1 (total threads), not #2. – Marmaduke Oct 18 '18 at 23:59
  • Thanks for your comment @Marmaduke. While I agree with you that this is the most reasonable interpretation of `--jobs`, and the help menu clearly states it: """ --cores [N], --jobs [N], -j [N] Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores. """ practically speaking, I have been using snakemake a lot, and with qsub, I found `--jobs` to work the way I described it above. – Alon Shaiber Oct 19 '18 at 16:45

0 Answers0