2

I'm a bit new on snakemake.

Imagine that I have a rule, like the one below (I've set the number of threads to 10).

Is there any way to make snakemake magically handles the parallelization of the loop for in this rule?

rule MY_RULE:
    input:
        input_file=TRAIN_DATA
    output:
        output_file=OUTPUT_DATA
    threads: 10
    run:        
        for f,o in zip(input.input_file, output.output_file):
            DO_SOMETHING_AND_SAVE(f,o)

Thanks

1 Answers1

1

I guess your rule could be re-written as (with additional code to make a small self-contained example):

TRAIN_DATA = ['a.txt', 'b.txt', 'c.txt']
OUTPUT_DATA = ['a.out', 'b.out', 'c.out']

files= dict(zip(OUTPUT_DATA, TRAIN_DATA))

wildcard_constraints:
    x= '|'.join([re.escape(x) for x in OUTPUT_DATA])

rule all:
    input:
        expand('{x}', x= OUTPUT_DATA),

rule MY_RULE:
    input:
        input_file= lambda wc: files[wc.x]
    output:
        output_file= '{x}'
    run:        
        DO_SOMETHING_AND_SAVE(input.input_file, output.output_file)

This will run rule MY_RULE for each input/output pair in parallel. Of course, the details depend on what exactly you want to do before and after MY_RULE...

dariober
  • 8,240
  • 3
  • 30
  • 47
  • That's correct, but the question was about parallelization. – Dmitry Kuzminov Oct 25 '19 at 19:35
  • Hi, thanks for the answer, I've tried to run you exactly your example without success. I got the following exception: ``` Building DAG of jobs... WorkflowError: Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards. ``` To reproduce it I created fake `[a,b,c].txt` files and a function DO_SOMETHING_AND_SAVE. – Tiago Freitas Pereira Oct 28 '19 at 09:33
  • @TiagoFreitasPereira - Can you show how you are executing snakemake? You get that error if you execute it as `snakemake [some options] MY_RULE` instead of just `snakemake [some options]`. As the error says, the final (target) rule cannot contain wildcards. – dariober Oct 28 '19 at 11:17
  • Yes, you got it right. This was exactly the problem I was having. Now it works and in parallel :-) I have another question. Is it straight forward to run this "parallel" rule in a SGE grid? My plan is to run this rule as a job array e.g `qsub -t 10 .......` (I'm doing an exploratory analysis on snakemake for my research). Thanks. – Tiago Freitas Pereira Oct 28 '19 at 13:08
  • @TiagoFreitasPereira Yes- it is straightforward to run it on a cluster (it's one of the great things of snakemake), see [cluster-execution](https://snakemake.readthedocs.io/en/stable/executable.html#cluster-execution). The simplest would be `snakemake --cluster "qsub -t 10 ..." -j `. I'm not sure about snakemake and job arrays as I haven't used them much but try to google it or submit it as another question. – dariober Oct 28 '19 at 13:33