In rule Script, I have a few samples that fail out, but the majority pass. I would like Snakemake to see these failures and continue with the downstream rule rule build_script_table. I am not really sure how to do this. Any help on this would be much appreciated. Currently I have a crude .py script that handles this, but want to automate this if possible.
rule script:
input: input_files
output:
'script_out/{sampleID}/{sampleID}.out.tsv'
threads: 8
params:
Toys = config['Toys_dir'],
db = config['Toys_db'],
run:
shell('export PATH={params.Toys}/samtools-0.1.19:$PATH; \
rm -r script_out/{wildcards.sampleID}; \
{params.Toys}/Toys.pl \
-name {wildcards.sampleID} \
-o script_out/{wildcards.sampleID} \
-db {params.db} \
-p {threads} \
{input}')
rule script_copy:
input: rules.script.output
output: 'script_calls/{sampleID}_out_filtered.tsv'
run:
shell('cp {input} {output}')
rule build_script_table:
input: expand('script_calls/{sampleID}_out_filtered.tsv', sampleID=sampleIDs)
output: 'tables/all_script.txt'
params:
span = config['length'],
run:
dfs = []
for fname in input:
df = pandas.read_csv(fname, sep='\t')
if len(df) > 0:
df['sampleID'] = fname.split('/')[-1].split('_')[0]
df['Toyscript'] = 1
df['Match'] = df.apply(lambda row: sorted_Match(row['ToyName1'], row['ToyName2']), axis=1)
df['supporting_prices'] = df.spanningdates
df['total_price'] = df['supporting_prices'].groupby(df['Match']).transform('sum') # combine fusions that are A|B and B|A
df.drop_duplicates('Match', inplace=True) # only keep the first row of each fusion now that support reads are summed
df = df[df['total_price'] >= params.length] # remove fusions with too few supporting reads
scores = list(range(1, len(df) + 1))
scores.reverse() # you want the fusions with the most reads getting the highest score
df.sort_values(by=['total_price'], ascending=False, inplace=True)
df['script_rank'] = scores
df['script_score'] = df['script_rank'].apply(lambda x: float(x)/len(df)) # percent scores for each fusion with 1 being top fusion
dfs.append(df)
dfsc = pandas.concat(dfs)
dfsc.to_csv(output[0], sep='\t', index=False)