0

I have several genome files, all with suffix .1.ht2l to .8.ht2l

bob.1.ht2l
bob.2.ht2l
bob.3.ht2l
bob.4.ht2l 
bob.5.ht2l
bob.6.ht2l
bob.7.ht2l
bob.8.ht2l
steve.1.ht2l ....steve.8.ht2l and so on

and sereval RNAseq samples, like

flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz and so on also with different tissues.

I would like to align all rnaseq reds against the genomes. UPDATED:

workdir: "/path/to/aligned"  
(HISAT2_INDEX_PREFIX,)=glob_wildcards("/path/to/index/{prefix}.1.ht2l")
(SAMPLES,)=glob_wildcards("/path/to/{sample}_1.fastq.gz") 
print(HISAT2_INDEX_PREFIX)  
print (SAMPLES)

rule all:
    input: 
        expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)

rule hisat2:
    input:
        hisat2_index=expand("%s.{ix}.ht2l" % "/path/to/index/{prefix}", ix=range(1, 9)),
        fastq1="/path/to/{sample}_1.fastq.gz",
        fastq2="/path/to/{sample}_2.fastq.gz"
    output:
        bam = "{prefix}.{sample}.bam",
        txt = "{prefix}.{sample}.txt",
    log: "{prefix}.{sample}.snakemake_log.txt"
    threads: 5
    shell:
      "/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {HISAT2_INDEX_PREFIX}"
      " -1 {input.fastq1} -2 {input.fastq2}  --summary-file {output.txt} |"
      "/Tools/samtools-1.9/samtools sort -@ {threads} -o {output.bam}"

I get missing input files error, I am sure the error is somewhere with suffixes, however,I do not know how to solve and any suggestion is much appreciated. UPDATE: ALL genome files bob.1.ht2l..bob.8.ht2l, steve.1.ht2l..steve.8.ht2l get invoked all at once, why is that?

user3224522
  • 1,119
  • 8
  • 19

1 Answers1

2

Let's compare these 2 lines:

(HISAT2_INDEX_PREFIX,)=glob_wildcards("/path/to/index/{prefix}.1.ht2l")
        hisat2_index=expand("%s.{ix}.ht2l" % HISAT2_INDEX_PREFIX, ix=range(1, 9)),

The first line explicitly specifies the /path/to/index/, the second misses this path. Be consistent, that should fix your problem.

Dmitry Kuzminov
  • 6,180
  • 6
  • 18
  • 40
  • thanks, it is still taking all genomes.{ix}.ht2l all as once as input....what else could be wrong there? – user3224522 Dec 26 '19 at 14:52
  • @user3224522 Please provide the updated code. Next the results of all the `glob_wildcards` invocations and actual value of `hisat2_index=expand("%s.{ix}.ht2l" % HISAT2_INDEX_PREFIX, ix=range(1, 9))` should help to identify the issue. We don't see your filesystem, provide us more info. – Dmitry Kuzminov Dec 26 '19 at 23:07
  • I updated the question, please let me know if it is not clear, i will add more info. – user3224522 Dec 29 '19 at 14:33
  • @user3224522 What do you mean by "ALL genome files ... get invoked all at once"? Anyway I would strongly recommend to replace this old style string operation `"%s.{ix}.ht2l" % "/path/to/index/{prefix}"` with something newer newer like f-string to improve the readability. Snakemake treats the constructions in braces differently depending on context. – Dmitry Kuzminov Dec 31 '19 at 05:43
  • I am using old style, because the python installed on that pc is < 3.6. I am still struggling to get the result. – user3224522 Jan 15 '20 at 13:05
  • @user3224522 You ignored the first part of my question. I would recommend you to show here the **exact** file structure (e.g. using the `tree` command). Next, you can get rid of all the unnecessary bioinformatics details and use the script in form `touch {output}`. As long as you get the missing input exception, the actual scripts are irrelevant. Try to provide a minimal example with just `touch`, `cp`, `cat` shell scripts. – Dmitry Kuzminov Jan 15 '20 at 21:03