A question about parallelization overhead and I/O bound rules

Question

Imagine the following snakefile adapted (in the end of this post) from Handling parallelization and the rule MY_RULE is I/O bound (let's say that I need to load a very heavy model to apply this rule into a file).

With the way wildcards are used in this example, I have as many independent jobs as files. --jobs option "says" to the scheduler how many of them should run at the same time. Since MY_RULE is I/O bound, this is not a very efficient way to organize my snakefile.

The question is; is there an elegant way to make my wildcard smart enough to "split" my inputs accordingly to the number of --jobs I have? Let's say, I have 1024 files to be processed and 8 jobs available, is there any elegant way to make my wildcard smart enough so it can provides chunks of 128 mutually exclusive files to MY_RULE? I could implement this explicitly, but just want to know if this can be handled transparently.

Thank you very much for any input

import numpy

workdir: "test"

def DO_SOMETHING_AND_SAVE(input_file, output_file):
    """
    Adding this function here just for the sake of simplicity.
    Here I will simulated a job with an overhead
    """

    print(f"RUNNING MY VERY DEMANDING JOB with {input_file}")
    open(output_file, 'w').write("\n")


TRAIN_DATA = ['a.txt', 'b.txt', 'c.txt']
OUTPUT_DATA = ['a.out', 'b.out', 'c.out']

files= dict(zip(OUTPUT_DATA, TRAIN_DATA))

wildcard_constraints:
    x= '|'.join([re.escape(x) for x in OUTPUT_DATA])

rule all:
    input:
        expand('{x}', x= OUTPUT_DATA),

rule MY_RULE:
    input:
        input_file= lambda wc: files[wc.x]
    output:
        output_file= '{x}'
    run:        
        DO_SOMETHING_AND_SAVE(input.input_file, output.output_file)

Maarten-vd-Sande · Answer 1 · 2019-11-20T13:08:55.463

I don't think what you want exists out of the box without doing something hacky. However you could play with resources:

rule MY_RULE:
    input:
        input_file= lambda wc: files[wc.x]
    output:
        output_file= '{x}'
    resources:
        high_IO=1
    run:
        DO_SOMETHING_AND_SAVE(input.input_file, output.output_file)

Which you then can call:

snakemake --resources high_IO=8 --jobs=8

This has the advantage that you can supply more threads to the pipeline for down/upstream stuff.

A question about parallelization overhead and I/O bound rules

1 Answers1