0

I want to use snakemake to do bowtie2 mapping of split read files to a reference genome, and I'd like that rule to be integrated in the general workflow. For that purpose, I first defined a rule to create a bowtie index

rule build_bowtie_index:
    input:
        referenceGenomeFasta
    output:
        expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    shell:
        "bowtie2-build {input} {params.btindex}"

referenceGenomeFasta is reference genome fasta file, while referenceGenomeName is the same file without .fna suffix. The goal was to take foo.fna reference genome file and produce a list of index files foo.1.bt2, foo.2.bt2., foo.rev.bt1 etc. because bowtie2 requires that index (-X flag) be the name of the folder up to final dot, so I figured I'd store the referenceGenomeName as a global configurable variable.

Now I have a list of files containing split reads in the form of sampleFoo_1.fastq,sampleFoo_2.fastq, sampleBar_1.fastq,sampleBar_2.fastq etc. When I used bash scripts for mapping, I used the command:

The rule for creating bowtie2 index is the following:

bowtie2 -x {myindex} -1 {sample}_1.fastq -2 {sample}_2.fastq | samtools view -bS > {sample}.aligned.bam

I also used parallel processing via Sun Grid Engine to speed up the computation as I have a lot of samples. Now I want to transfer the workflow into snakemake. My rule for bowtie2 mapping must include my samples and the outputs of build_bowtie_index rule. The total thing looks like this:

SAMPLES = ["Foo", "Bar"]
rule align_via_bowtie2: 
    input:
        firstPair=expand("{rawSampleFolder}/{samples}_1.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
        secondPair=expand("{rawSampleFolder}/{samples}_2.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
        btindex1=expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        btindex2=expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    output:
        expand("{proccSampleFolder}/{samples}.aligned.bam", samples=SAMPLES, proccSampleFolder=proccSampleFolder)
    shell:
        "bowtie2 -x {params.btindex} -1 {input.firstPair} -2 {input.secondPair} | samtools view -bS -> {output}"

When I run this script with SAMPLES=["Foo"] the script works perfectly, and what's run is:

 bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq -2 raw_samples/Foo_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam

But when I run the script with SAMPLES=["Foo","Bar"], what snakemake executes is:

 bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq raw_samples/Bar_1.fastq -2 raw_samples/Foo_2.fastq raw_samples/Bar_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam proccessed_samples/Bar.aligned.bam

In other words, instead of mapping first Foo, and Bar second, it maps Foo Bar and I get an error. My problem is that I want to process a list of files as a separate job - or two separate lists of files containing split reads. What am I doing wrong?

ivan199415
  • 375
  • 3
  • 12
  • 1
    Does this answer your question? [Snakemake: expand params](https://stackoverflow.com/questions/60173518/snakemake-expand-params) – KeyboardCat Feb 15 '22 at 07:38

1 Answers1

0

This is duplicated from biostars, and since I've posted an answer is there, I'm linking it here as well.

ivan199415
  • 375
  • 3
  • 12