Imagine the following snakefile adapted (in the end of this post) from Handling parallelization
and the rule MY_RULE
is I/O bound (let's say that I need to load a very heavy model to apply this rule into a file).
With the way wildcards are used in this example, I have as many independent jobs as files. --jobs
option "says" to the scheduler how many of them should run at the same time.
Since MY_RULE
is I/O bound, this is not a very efficient way to organize my snakefile.
The question is; is there an elegant way to make my wildcard smart enough to "split" my inputs accordingly to the number of --jobs
I have?
Let's say, I have 1024 files to be processed and 8 jobs available, is there any elegant way to make my wildcard smart enough so it can provides chunks of 128 mutually exclusive files to MY_RULE
?
I could implement this explicitly, but just want to know if this can be handled transparently.
Thank you very much for any input
import numpy
workdir: "test"
def DO_SOMETHING_AND_SAVE(input_file, output_file):
"""
Adding this function here just for the sake of simplicity.
Here I will simulated a job with an overhead
"""
print(f"RUNNING MY VERY DEMANDING JOB with {input_file}")
open(output_file, 'w').write("\n")
TRAIN_DATA = ['a.txt', 'b.txt', 'c.txt']
OUTPUT_DATA = ['a.out', 'b.out', 'c.out']
files= dict(zip(OUTPUT_DATA, TRAIN_DATA))
wildcard_constraints:
x= '|'.join([re.escape(x) for x in OUTPUT_DATA])
rule all:
input:
expand('{x}', x= OUTPUT_DATA),
rule MY_RULE:
input:
input_file= lambda wc: files[wc.x]
output:
output_file= '{x}'
run:
DO_SOMETHING_AND_SAVE(input.input_file, output.output_file)