I plan to implement a pipeline where I search for specific transcripts at three different genomes to align the best hits and estimate some statistics by each.
To automatize the task in Snakemake, I run blat given the genomes sequence. Nonetheless, at some point, the pipeline needs to use the output transcripts from blat as the subsequent inputs. The problem is that I don't know which transcripts will be output by the checkpoint get_transcripts (see below). Does someone know how to read the directory containing the transcripts and use them as parallel inputs for the next steps? I tried to implement a function to read the path, but then the rule macse_align (see bellow) gets as input a list of files, and Snakemake does not iterate by transcripts name but instead tries to execute the rule using all the input at once.
I have checked similar post, but the solution usually use the list of file inside a directory as an input list for the next rule (e.g: use directories or all files in directories as input in snakemake)
Here is my code
import os
import glob
configfile: 'config.yaml'
ROOT = os.path.abspath('genomes') + '/'
for d in ['pslx','info','genes','annotations']:
os.makedirs(ROOT + d,exist_ok=True)
rule all:
input:
outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
"genomes/genes/",
[f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]
rule parse_cds:
input:
ROOT + config['PREFIX'] + 'cds.all.fa.gz'
output:
ROOT + config['FOCAL'] + '_cds.fa'
shell:
"""bioawk -c fastx '{{print ">"$name"\\n"$seq}}' {input} > {output}"""
rule pblat:
input:
cds = ROOT + config['FOCAL'] + '_cds.fa',
genome = ROOT + "{outgroup}.fa.gz"
output:
ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
threads:
config['THREADS']
shell:
"""
pblat {input.genome} {input.cds} -t=dna -q=dna -minIdentity=60 -fine -threads={threads} -out=pslx {output}
"""
rule pslx_reps:
input:
ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
output:
pslx=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.pslx",
psr=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.psr"
shell:
"pslReps -nohead {input} {output.pslx} {output.psr}"
rule pslx_info:
input:
ROOT + "pslx/" + "human_{outgroup}_reps.pslx"
output:
fa = ROOT + "human_{outgroup}.fa",
info = ROOT + "info/" + "human_{outgroup}.info"
shell:
"perl scripts/pslx_to_fasta.pl --pslx {input} --fasta {output.fa} --info {output.info}"
checkpoint get_transcripts:
input:
outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
infos = expand(ROOT+'info/{outgroup}_back.info',outgroup=config['OUTGROUPS']),
focal = ROOT + config['FOCAL'] + '_cds.fa',
output:
directory(ROOT+'genes/')
params:
species_dict=config["OUTGROUPS"],
distance=config['DISTANCE'],
gtf = ROOT + config['GTF']
script:
'scripts/test.py'
# input function for rule macse_align, return paths to all files produced by the checkpoint 'somestep'
def list_transcripts(wildcards):
checkpoint_output = checkpoints.get_transcripts.get(**wildcards).output[0]
in_dir = glob.glob(checkpoint_output+'/*/*')
return [i + "/" + i.split('/')[-1] + '.fa' for i in in_dir]
rule macse_align:
input:
list_transcripts
output:
[f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]
shell:
"""
java -Xmx8G -jar scripts/macse_v2.06.jar -prog alignSequences -seq {input} -out_NT {output}
"""