Snakemake: HISAT2 alignment of many RNAseq reads against many genomes UPDATED

Question

I have several genome files with suffix .1.ht2l to .8.ht2l

bob.1.ht2l
...
bob.8.ht2l
steve.1.ht2l 
...
steve.8.ht2l

and sereval RNAseq samples

flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz

I need to align all rnaseq reads against each genome. UPDATED as dariober suggested:

workdir: "/path/to/aligned"  
(HISAT2_INDEX_PREFIX,)=glob_wildcards("/path/to/index/{prefix}.1.ht2l")
(SAMPLES,)=glob_wildcards("/path/to/{sample}_1.fastq.gz") 
print(HISAT2_INDEX_PREFIX)  
print (SAMPLES)

rule all:
    input: 
        expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)

rule hisat2:
    input:
        hisat2_index=expand("%s.{ix}.ht2l" % "/path/to/index/{prefix}", ix=range(1, 9), prefix = HISAT2_INDEX_PREFIX),
        fastq1="/path/to/{sample}_1.fastq.gz",
        fastq2="/path/to/{sample}_2.fastq.gz"
    output:
        bam = "{prefix}.{sample}.bam",
        txt = "{prefix}.{sample}.txt",
    log: "{prefix}.{sample}.snakemake_log.txt"
    threads: 5
    shell:
      "/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {/path/to/index/{wildcards.prefix}"
      " -1 {input.fastq1} -2 {input.fastq2}  --summary-file {output.txt} |"
      "/Tools/samtools-1.9/samtools sort -@ {threads} -o {output.bam}"

The problem I get is when running HISAT2 is taking as -x input all bob.1.ht2l:bob.8.ht2l and steve.1.ht2l:steve.8.ht2l at once. While rna-seq should be mapped at each genome separately. Where is the error? NB: my previous question: Snakemake: HISAT2 alignment of many RNAseq reads against many genomes

score 1 · Accepted Answer · answered Jan 15 '20 at 16:17

1

I think your confusion comes from the fact that hisat wants a prefix to the index files, not all the list of index files. So instead of -x {input.hisat2_index} (i.e. the list of index files) use something like -x /path/to/{wildcards.prefix}.

In other words, the input hisat2_index=expand(...) should be there only to tell snakemake to start this rule only after these files are ready but you don't use them directly (well, hisat does use them of course but you don't pass them on the command line).

answered Jan 15 '20 at 16:17

dariober

8,240
3
30
47

Hi dariober, thanks a lot for a suggestion, I still get all genomes in the input, is it normal, would it work anyway? something like this: `rule hisat2: input: bob.1.ht2l...bob.8.ht2l, steve.1.ht2l...steve.8.ht2l output: bob.flower_kevin.bam, bob.flower_kevin.txt log: bob.flower_kevin.snakemake_log.txt jobid: 2 wildcards: prefix=bob, sample=flower_kevin` – user3224522 Jan 16 '20 at 08:50
@user3224522 can you update your post to show the code you have tried? It's difficult to tell what's wrong otherwise – dariober Jan 16 '20 at 11:19
sure, I updated, I added `-x {/path/to/index/{wildcards.prefix}` as you suggested – user3224522 Jan 16 '20 at 11:27
1

You have an extra `{`. Instead of `-x {/path/...` you should have `-x /path/to/index/{wildcards.prefix}`. Apart from that, it may be good. Just try to run it and see. Run snakemake with `-p` (to print shell commands) and `-n` (for dry run) – dariober Jan 16 '20 at 12:02
yes, sorry, there is no extra `{` (typing error), I run it as you wrote above..anyway when I run I can see the list of all genomes in input section, but we have wildcards it should run for all separately right? – user3224522 Jan 16 '20 at 12:34

Snakemake: HISAT2 alignment of many RNAseq reads against many genomes UPDATED

1 Answers1

Linked