0

I plan to implement a pipeline where I search for specific transcripts at three different genomes to align the best hits and estimate some statistics by each.

To automatize the task in Snakemake, I run blat given the genomes sequence. Nonetheless, at some point, the pipeline needs to use the output transcripts from blat as the subsequent inputs. The problem is that I don't know which transcripts will be output by the checkpoint get_transcripts (see below). Does someone know how to read the directory containing the transcripts and use them as parallel inputs for the next steps? I tried to implement a function to read the path, but then the rule macse_align (see bellow) gets as input a list of files, and Snakemake does not iterate by transcripts name but instead tries to execute the rule using all the input at once.

I have checked similar post, but the solution usually use the list of file inside a directory as an input list for the next rule (e.g: use directories or all files in directories as input in snakemake)

Here is my code


import os
import glob

configfile: 'config.yaml'

ROOT = os.path.abspath('genomes') + '/'

for d in ['pslx','info','genes','annotations']:
    os.makedirs(ROOT + d,exist_ok=True)

rule all:
    input:  
        outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
        "genomes/genes/",
        [f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]



rule parse_cds:
    input:
        ROOT + config['PREFIX'] + 'cds.all.fa.gz'
    output:
        ROOT + config['FOCAL'] + '_cds.fa'
    shell:
        """bioawk -c fastx '{{print ">"$name"\\n"$seq}}' {input} > {output}"""

rule pblat:
    input:
        cds    = ROOT + config['FOCAL'] + '_cds.fa',
        genome = ROOT + "{outgroup}.fa.gz"
    output:
        ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
    threads:
        config['THREADS']
    shell:
        """
            pblat {input.genome} {input.cds} -t=dna -q=dna -minIdentity=60 -fine -threads={threads} -out=pslx {output}
        """

rule pslx_reps:
    input:
        ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
    output:
        pslx=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.pslx",
        psr=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.psr"

    shell:
        "pslReps -nohead {input} {output.pslx} {output.psr}"

rule pslx_info:
    input:
        ROOT + "pslx/" + "human_{outgroup}_reps.pslx"
    output:
        fa   = ROOT + "human_{outgroup}.fa",
        info = ROOT + "info/" + "human_{outgroup}.info"
    shell:
        "perl scripts/pslx_to_fasta.pl --pslx {input} --fasta {output.fa} --info {output.info}"

checkpoint get_transcripts:
    input:
        outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
        infos     = expand(ROOT+'info/{outgroup}_back.info',outgroup=config['OUTGROUPS']),
        focal     = ROOT + config['FOCAL'] + '_cds.fa',
    output:
        directory(ROOT+'genes/')
    params:
        species_dict=config["OUTGROUPS"],
        distance=config['DISTANCE'],
        gtf = ROOT + config['GTF']
    script:
        'scripts/test.py'


# input function for rule macse_align, return paths to all files produced by the checkpoint 'somestep'
def list_transcripts(wildcards):
    checkpoint_output = checkpoints.get_transcripts.get(**wildcards).output[0]
    in_dir = glob.glob(checkpoint_output+'/*/*')
    return [i + "/" + i.split('/')[-1] + '.fa' for i in in_dir]


rule macse_align:
    input:
        list_transcripts
    output:
        [f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]
    shell:
        """ 
            java -Xmx8G -jar scripts/macse_v2.06.jar -prog alignSequences -seq {input} -out_NT {output}
        """
273K
  • 29,503
  • 10
  • 41
  • 64
  • Can you provide a simplified version of your problem? If I understand your correctly you want `rule macse_align` to be run the individual files returned from the `list_transcripts` function rather than using all of them as input at the same time? – euronion Jan 03 '23 at 12:58
  • Yes, that is precisely the problem. When executing `rule macse_align`, the shell command must execute each file list by the `list_transcript` function. However, `{input} wildcard` inputs any listed files into the the shell command, making it impossible to run and parallelize given the job number. Here is the a reduced version of the error command that cannot be executed: `java -Xmx8G -jar scripts/macse_v2.06.jar -prog alignSequences -seq ENST00000207636.fa ENST00000006251.fa ENST00000155674.fa ENST00000215742.fa --out_nt ENST00000207636_NT.fa ENST00000006251_NT.fa ENST00000155674_NT.fa` – Jesús Murga Moreno Jan 03 '23 at 13:54
  • Ok. So the mapping between input files and output files of `rule macse_align` is 1:1. You can rewrite your function `list_transcripts` to produce a list of files imitating the output of `rule macse_align`. Then setup a e.g. `rule macse_align_all` which takes this list as input. The input there should match the output/wildcard of `rule macse_align` and from those wildcards you can determine the input of the `rule macse_align`. In essence you create a rule which causes `macse_align` to be executed for each file returned from the checkpoint-function `list_transcripts`. – euronion Jan 03 '23 at 15:52

0 Answers0