4

I'm writing a Snakemake pipeline for scRNAseq sequence processing which uses STAR as the alignment tool.

Loading genome index into the memory for each alignment job is very time-consuming. Since I have a lot of memory at my disposal, I figure that it is feasible to use the "shared memory" module from STAR. With it I can load the genome into the memory and keep it there until all the jobs are done.

With shell this is very easy to achieve, for example here.

But chianing it in a Snakemake pipeline seems a non-trivial task, especially that the STAR "shared memory" module doesn't require any input or output anything. For example, I tried:

# Not full pipeline
rule umi_tools_extract:
    input:
        read1="{sample}_R1.fq.gz",
        read2="{sample}_R2.fq.gz",
        whitelist="whitelist.txt"
    output:
        "{sample}_R1_extracted.fq.gz"
    shell:
        """
        umi_tools extract --bc-pattern=CCCCNNNN \
                          --stdin {input.read1} \
                          --read2-in {input.read2} \
                          --stdout {output} \
                          --read2-stdout --filter-cell-barcode --error-correct-cell \
                          --whitelist={input.whitelist}
        """

rule STAR:
    input:
        fq="{sample}_R1_extracted.fq.gz",
        genomeDir=directory("/path/to/genome")
    output:
        "{sample}_Aligned.sortedByCoord.out.bam"
    threads:32
    shell:
        """
        STAR --runThreadN {threads} \
             --genomeLoad LoadAndKeep \
             --genomeDir {input.genomeDir} \
             --readFilesIn {input.fq} \
             --readFilesCommand zcat \
             --limitBAMsortRAM 20000000000 \
             --outFilterMultimapNmax 1 \
             --outFilterType BySJout \
             --outSAMstrandField intronMotif \
             --outFilterIntronMotifs RemoveNoncanonical \
             --outFilterMismatchNmax 6 \
             --outSAMtype BAM SortedByCoordinate \
             --outFileNamePrefix {wildcards.sample}_
        """
... ...

rule STAR_unload:
    input:
        genomeDir=dirctory("/path/to/genome")
#    output:  ## Put something here??
#        ""
    shell:
        """
        STAR --genomeLoad Remove \
             --genomeDir {input.genomeDir} \
        """

But obviously this wouldn't work. I'm thinking of putting a dummy output in the STAR_unload rule and feed it into rules downstream of STAR so that the STAR_unload will be triggered after alignment jobs are done.

Any help would be appreciated!

  • This question should probably be moved to https://bioinformatics.stackexchange.com/! – Konrad Rudolph Nov 09 '21 at 07:27
  • Hi @KonradRudolph Before submitting the question, I did a keyword search with "STAR shared memory snakemake" on both SO and Bioinformatics StackExchange and found SO has more related activities, so I posted it here :-) Thanks for your suggestion! – Ruiyu Ray Wang Nov 10 '21 at 08:28
  • Such a search is probably misleading: Stack Overflow has vastly more contents than the Bioinformatics site, but (a) much of that content is historical and (b) your question will be drowned in a sea of other topics, whereas Bioinformatics has more focus on such questions. So the chance of getting a good answer is a lot higher on the latter, even though it has vastly less overall activity. Anyway, glad you found a solution. – Konrad Rudolph Nov 10 '21 at 09:21

2 Answers2

4

the STAR "shared memory" module doesn't require any input or output anything

I'm not familiar with STAR and the shared memory option, but the issue about input/output can be easily resolved using dummy, flag files that signal a step is done. E.g.:

rule index_genome:
    input:
        fa= 'genome.fa',
    output:
        done= touch('index.done'),
    shell:
        r"""
        STAR index ... {input.fa} 
        """

rule load_genome:
    input:
        'index.done',
    output:
        touch('loading.done'),
    shell:
        r"""
        STAR --genomeLoad LoadAndExit --genomeDir ...
        """

rule align:
     input:
         fq= 'reads.fastq',
         idx= 'loading.done',
     output:
         ...
     shell:
         r"""
         STAR ...
         """

rule unload_genome:
    # Delete the loading.done flag file otherwise subsequent runs of the pipeline 
    # will fail to load the genome again if STAR alignment is needed.
    input:
        # Generic function that aggregates all alignment outputs
        bams= get_files(),
        idx= 'loading.done',
    output:
        "logs/STARunload_Log.out"
    shell:
        r"""
        STAR --genomeLoad Remove \
             --genomeDir {input.genomeDir} \
             --outFileNamePrefix logs/STARunload_
        rm {input.idx}
        """"
dariober
  • 8,240
  • 3
  • 30
  • 47
1

I found a workaround similar to @dariober's solution, using dummy files.

Using the --outFileNamePrefix flag in STAR, I can create a dummy log file during genome unload. But this is not as clean as dariober's answer as STAR creates three files Log.final.out, Log.out, and Log.progress.out automatically.

# Unload STAR genome index
rule STAR_unload:
input:
    # Generic function that aggregates all alignment outputs
    bams=get_files(),
    genomeDir=directory("/path/to/index")
output:
    "logs/STARunload_Log.out"
shell:
    """
    STAR --genomeLoad Remove \
         --genomeDir {input.genomeDir} \
         --outFileNamePrefix logs/STARunload_
    """

rule some_downstream_step:
input:
    bam="Aligned.sortedByCoord.out.bam"
    dummy="logs/STARunload_Log.out"
output:
    "genes_assigned.bam"
shell:
    """
    ... ...
    """
  • Note that when you unload the genome you should also delete the `loading.done` flag file. I've edited my answer to add rule `unload_genome`. – dariober Nov 11 '21 at 08:56
  • 1
    Yes, I suppose I could also mark the loading.done flag file as temporary? i.e. `temp(touch('loading.done'))`. – Ruiyu Ray Wang Nov 12 '21 at 04:31