Elegantly handle samples with insufficient data in workflow?

Question

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.

Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.

Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?

Many thanks, -jon

Welcome to SO you should take a look at [tour] and maybe review [ask] to better understand how we can help you — happymacarts, Dec 20 '16 at 20:03

score 0 · Accepted Answer · edited May 23 '17 at 11:45

I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.

One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.

Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.

score 0 · Answer 2 · answered Mar 18 '22 at 04:14

I've been trying to find a solution to this issue as well.

Thus far I think I've identified a few potential solution but have yet to be able to correctly implement them.

I use seqkit stats to quickly generate a txt file and use the num_seqs column to filter with. You can write a quick pandas function to return a list of files which pass your threshold, and I use config.yaml to pass the minimum read threshold:

def get_passing_fastq_files(wildcards):
    qc = pd.read_table('fastq.stats.txt').fillna(0)
    passing = list(qc[qc['num_seqs'] > config['minReads']]['file'])
    return passing

Trying to implement that as an input function in Snakemake has been an esoteric nightmare to be honest. Probably my own lack of nuanced understand about the Wildcards object.

I think the use of a checkpoint is also necessary in the process to force Snakemake to recompute the DAG after filtering samples out. Haven't been able to connect all the dots yet however, and I'm trying to avoid janky solutions that use token files etc.

Elegantly handle samples with insufficient data in workflow?

2 Answers2