1

Unexperienced, self-tought "coder" here, so please be understanding :]

I am trying to learn and use Snakemake to construct pipeline for my analysis. Unfortunatly, I am unable to run multiple instances of a single job/rule at the same time. My workstation is not a computing cluster, so I cannot use this option. I looked for an answer for hours, but either there is non, or I am not knowledgable enough to understand it. So: is there a way to run multiple instances of a single job/rule simultaneously?

If You would like a concrete example:

Lets say I want to analyze a set of 4 .fastq files using fastqc tool. So I input a command:

time snakemake -j 32

and thus run my code, which is:

SAMPLES, = glob_wildcards("{x}.fastq.gz")

rule Raw_Fastqc:
    input:
            expand("{x}.fastq.gz", x=SAMPLES)
    output:
            expand("./{x}_fastqc.zip", x=SAMPLES),
            expand("./{x}_fastqc.html", x=SAMPLES)
    shell:
            "fastqc {input}"

I would expect snakemake to run as many instances of fastqc as possible on 32 threads (so easily all of my 4 input files at once). In reality. this command takes about 12 minutes to finish. Meanwhile, utilizing GNU parallel from inside snakemake

shell:
    "parallel fastqc ::: {input}"

I get results in 3 minutes. Clearly there is some untapped potential here.

Thanks!

AdrianS85
  • 11
  • 1
  • 3
  • 1
    Similar issue here: https://stackoverflow.com/q/50828233/1878788. This seems a common pitfall. – bli Jun 26 '18 at 17:58
  • Yes, I saw this topic, but incorrectly I thought that my problem is different because I do not use computing clusters. Hence replicate question. Cheers! – AdrianS85 Jun 27 '18 at 11:41

2 Answers2

3

If I am not wrong, fastqc works on each fastq file separately, and therefore your implementation doesn't take advantage of parallelization feature of snakemake. This can be done by defining the targets as shown below using rule all.

from pathlib import Path

SAMPLES = [Path(f).name.replace('.fastq.gz', '')  for f in glob_wildcards("{x}.fastq.gz") ]

rule all:
    input:
        expand("./{sample_name}_fastqc.{ext}", 
                        sample_name=SAMPLES, ext=['zip', 'html'])

rule Raw_Fastqc:
    input:
            "{x}.fastq.gz", x=SAMPLES
    output:
            "./{x}_fastqc.zip", x=SAMPLES,
            "./{x}_fastqc.html", x=SAMPLES
    shell:
            "fastqc {input}"
Manavalan Gajapathy
  • 3,900
  • 2
  • 20
  • 43
  • Thank You! You are absolutely correct! Although I did not use the exact code You provided, the parallel processing did work when I included "all" rule and removed "expand"s from subsequent rule. The only problem now is that I do not understand why is the snakemake acting like this. Perhaps You could point me to appropriate tutorial/manual (if it is explained at http://snakemake.readthedocs.io, I cannot see it or understand it...)? – AdrianS85 Jun 27 '18 at 11:32
  • @AdrianS85 Snakemake will parallelize by running several instances of a rule (i.e. "jobs") in parallel (with different values of the wildcards, as determined from its output). When you use `expand`, you "consume/resolve" wildcards. A rule whose output is an `expand` over an `x` wildcard says: "One job will output multiple files", not "One job will output the file corresponding to a given value of `x`". The typical way to do is to have an `all` "driving rule", that has an `expand` in its input, just to say "In the end, I want all those files", and a "driven rule", with a wildcard in its output. – bli Jun 27 '18 at 13:00
  • @AdrianS85 I tried to develop these explanations here: https://bitbucket.org/snakemake/snakemake/pull-requests/307/helping-understand-expand-and/diff Hopefully this will end up integrated in the official documentation and improved. – bli Jun 28 '18 at 09:42
1

To add to JeeYem's answer above, you can also define the number of resources to reserve for each job using the 'threads' property of each rule, as so:

rule Raw_Fastqc:
input:
        "{x}.fastq.gz", x=SAMPLES
output:
        "./{x}_fastqc.zip", x=SAMPLES,
        "./{x}_fastqc.html", x=SAMPLES
threads: 4
shell:
        "fastqc --threads {threads} {input}"

Because fastqc itself can use multiple threads per task, you might even get additional speedups over the parallel implementation.

Snakemake will then automatically allocate as many jobs as can fit within the total threads provided by the top-level call:

snakemake -j 32, for example, would execute up to 8 instances of the Raw_Fastqc rule.

Jon
  • 83
  • 4
  • Thanks for a hint! Unfortunatly, I dont think it works for `fastqc` After snakemake parallelization worked, I tried adding `--threads 4` to `shell: "fastqc"`, but it did not speed up the execution. – AdrianS85 Jun 27 '18 at 11:37