0

In rule Script, I have a few samples that fail out, but the majority pass. I would like Snakemake to see these failures and continue with the downstream rule rule build_script_table. I am not really sure how to do this. Any help on this would be much appreciated. Currently I have a crude .py script that handles this, but want to automate this if possible.

rule script:
    input: input_files
    output:
        'script_out/{sampleID}/{sampleID}.out.tsv'
    threads: 8
    params:
        Toys = config['Toys_dir'],
        db = config['Toys_db'],
    run:
        shell('export PATH={params.Toys}/samtools-0.1.19:$PATH; \
                rm -r script_out/{wildcards.sampleID}; \
                {params.Toys}/Toys.pl \
                -name {wildcards.sampleID} \
                -o script_out/{wildcards.sampleID} \
                -db {params.db} \
                -p {threads} \
                {input}')

rule script_copy:
    input: rules.script.output
    output: 'script_calls/{sampleID}_out_filtered.tsv'
    run:
        shell('cp {input} {output}')

rule build_script_table:
    input: expand('script_calls/{sampleID}_out_filtered.tsv', sampleID=sampleIDs)
    output: 'tables/all_script.txt'
    params:
        span = config['length'],
    run:
        dfs = []
        for fname in input:
            df = pandas.read_csv(fname, sep='\t')
            if len(df) > 0:
                df['sampleID'] = fname.split('/')[-1].split('_')[0]
                df['Toyscript'] = 1
                df['Match'] = df.apply(lambda row: sorted_Match(row['ToyName1'], row['ToyName2']), axis=1)
                df['supporting_prices'] = df.spanningdates
                df['total_price'] = df['supporting_prices'].groupby(df['Match']).transform('sum') # combine fusions that are A|B and B|A
                df.drop_duplicates('Match', inplace=True) # only keep the first row of each fusion now that support reads are summed
                df = df[df['total_price'] >= params.length] # remove fusions with too few supporting reads
                scores = list(range(1, len(df) + 1))
                scores.reverse() # you want the fusions with the most reads getting the highest score
                df.sort_values(by=['total_price'], ascending=False, inplace=True)
                df['script_rank'] = scores
                df['script_score'] = df['script_rank'].apply(lambda x: float(x)/len(df)) # percent scores for each fusion with 1 being top fusion
                dfs.append(df)

        dfsc = pandas.concat(dfs)
        dfsc.to_csv(output[0], sep='\t', index=False)
Genetics
  • 279
  • 2
  • 11
  • Duplicate of https://stackoverflow.com/questions/59039623/snakemake-how-to-execute-downstream-rules-when-an-upstream-rule-fails ? – dariober Mar 14 '22 at 17:57
  • @ dariober, it actually isnt a duplicate. As I am running through two rules not 1 additional one. Do you have a suggestion you could write out? – Genetics Mar 14 '22 at 18:09
  • 1
    Maybe try to give a runnable self-contained example. The idea is that a rule that is allowed to fail writes a dummy empty file (or a file with some meaningful message) in case of failure. Downstream rules behave according to whether their input is a regular file or it is a dummy file. – dariober Mar 14 '22 at 19:24
  • 1
    By the way, do you really need to run samtools 0.1.19? It is very old now... – dariober Mar 14 '22 at 19:26
  • @ dariober, for this PJ yes. I hear what you are saying. I am just having trouble envisiong this. Thus, having a difficult time writing this. – Genetics Mar 14 '22 at 19:28

2 Answers2

0

Perhaps not any more elegant, but I think you could split your workflow in half and invoke snakemake twice.

# partial.smk
rule may_fail:
    output: {sample}.out
# rest.smk
def build_table_input(wildcards):
    return expand('{sample}.out', sample=glob_wildcards('{sample}.out').sample)

rule build_table:
    input: build_table_input
    output: output.txt

To run, you execute the partial snakefile with the --keep-going flag and once that is done you execute the rest:

snakemake -s partial.smk --keep-going ; snakemake -s rest.smk

I think the right way to go about it is to catch the "fail out" happening in the script rule and instead of throwing an exit 1 you create an empty file that you handle separately in the build_table rule.

Troy Comi
  • 1,579
  • 3
  • 12
0

See if this helps. As I say in comments, write a dummy file in case of acceptable failure and use it downstrean to decide what to do:

rule script_can_fail:
    input: input_files
    output:
        'script_out/{sampleID}/{sampleID}.out.tsv'
    run:
        cmd = "samtools ..."
        p = subprocess.Popen(cmd, shell= True, stdout= subprocess.PIPE, stderr= subprocess.PIPE)
        stdout, stderr= p.communicate()

        if p.returncode != 0:
            # Analyze exit code and stderr. If the error can be ignored write dummy a file
            if 'SOME KNOWN ERROR' in stderr.decode():
                with open(output) as fout:
                    fout.write('FAILED')
            else:
                # This is an error that should not be ignored
                raise Exception(...)
             

rule script_copy:
    input: rules.script.output
    output: 'script_calls/{sampleID}_out_filtered.tsv'
    run:
        exit_status = open(input).readline()[0].strip()
        if exit_status == 'FAILED':
            # Write another dummy file signalling acceptable error upstream
            shell('echo "FAILED" > {output}')
        else:
            shell("do something else...")
            

rule build_script_table:
    input: expand('script_calls/{sampleID}_out_filtered.tsv', sampleID=sampleIDs)
    output: 'tables/all_script.txt'
    run:
        for fname in input:
            exit_status = open(fname).readline()[0].strip()
            if exit_status == 'FAILED':
                # Skip this file or do something meanigful with this fname
            else:
                df = pandas.read_csv(fname, sep='\t')
                # etc...
dariober
  • 8,240
  • 3
  • 30
  • 47
  • @ dariober, I assume in your rule, rule script_can_fail, where cmd == ... I am sticking all of my previous shell from my first rule? – Genetics Mar 15 '22 at 14:50
  • @Genetics, yes `cmd` is the string of commands you want to execute on the shell. In your case `cmd` is like `export PATH=...`. I use `subprocess` here instead of snakemake's `shell` function to have more control to decide what to do when the exit code is non-zero. – dariober Mar 15 '22 at 15:05