0

I'm new to building Snakefiles and for my bioinformatics research, I'm trying to loop my rules over multiple samples. I looked for similar questions and answers, but I can't seem to fix this problem. It may be because I still don't really understand how Snakemake works exactly. If you guys can help me out that would be great.

At the moment I have multiple rules, which currently works for one sample:

# variables for every species
SAMPLE = "SRR8528338"
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

configfile: "./envs/contigs/" + SAMPLE + ".yaml"

var_variables = expand("results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var", nr = config["contig_nrs"])
#make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)

rule all:
    input:
         var_variables
#        make_contig_consensus

rule convert_to_fBAM:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/sam/Contig{nr}_AT.sam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        "src/read_var.py"
    output:
        "results/5_consensus_contigs/{sample}"
    params:
        "{sample}"
    shell:
        "python3 {input} {params}"

The config file differs for every sample (the numbers of contigs). For SRR8528338, it looks like this:

contig_nrs:
    1: ./results/4_mapped_contigs/SRR8528338/var/Contig1_AT_sort.var
    2: ./results/4_mapped_contigs/SRR8528338/var/Contig2_AT_sort.var
    3: ./results/4_mapped_contigs/SRR8528338/var/Contig3_AT_sort.var
    ...
    2146: ./results/4_mapped_contigs/SRR8528338/var/Contig2146_AT_sort.var 

However, I want to loop all these rules over multiple samples as referred to in the "SAMPLES" variable. Now I tried using double braces before, which worked for multiple samples. (Changing all 'SAMPLES' to {{sample}} and adding: , sample = SAMPLES). Then my code should be looking like this:

# variables for every species
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

for sample in SAMPLES:
    configfile: "./envs/contigs/" + sample + ".yaml"

var_variables = expand("results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var", sample = SAMPLES, nr = config["contig_nrs"])
make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)

rule all:
    input:
         var_variables
#        make_contig_consensus

rule convert_to_fBAM:
    input:
        "results/4_mapped_contigs/{{sample}}/sam/Contig{nr}_AT.sam"
    output:
        "results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
    output:
        "results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        "src/read_var.py"
    output:
        "results/5_consensus_contigs/{sample}"
    params:
        "{sample}"
    shell:
        "python3 {input} {params}"

However, when I run this I get an error. I'm not exactly sure, but I think it is because of the for loop (sample in SAMPLES):

Missing input files for rule all:
results/4_mapped_contigs/SRR8528338/var/Contig1266_AT_sort.var
results/4_mapped_contigs/SRR8528338/var/Contig1299_AT_sort.var
...

Now I was wondering: is there a way to expand the config file by using wildcards? Something like:

configfile: expand("./envs/contigs/{sample}.yaml", sample = SAMPLES)

Doing this will give me the error:

TypeError in line 4
expected str, bytes or os.PathLike object, not list

or do you have other solutions for this problem?

Thank you!


Update:

I've been trying some things out and I think it would be useful to change the config file into a nested dictionary instead of separate ones. It should look something like this:

    contigs:
         SRR8528336: - 1
                     - 2
                     - ...
                     - 2113
         SRR8528337: - 1
                      ...
          ...
    exons:
         SRR8528336: - 1
                      ...
                     - 1827
         SRR8528337: - 1
                       ...
                     - 1826
          ...

So for example, if I want to run for the samples: SRR8528338 until SRR8528340 I give this as input:

SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

and call the contigs by sample name:

var_variables = expand("results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var", nr = config["contigs"][wildcards.sample])

or exons by:

expand("results/7_exons/{{sample}}/var/exon{nr}_AT_sort.var", nr = config["exons"][wildcards.sample])

How does the 'wildcards.sample' works exactly if I only want to obtain the value?


Solution (and next problem) 31/7/2020

I made my changes according to bli, which is now working now:

# variables for every species
SAMPLES = "SRR8528347 SRR8528355 SRR8528356".split()

configfile: "./envs/config_contigs.yaml"

# bam = []
# sort_bam = []
# fpileup = []
var_variables = []
make_contig_consensus = []
blat_variables = []
extract_hits_psl = []
for sample in SAMPLES:
    contig_nrs = config[sample]
    for nr in contig_nrs:
        # bam.append("results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam".format(sample=sample, nr=nr))
        # sort_bam.append("results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam".format(sample=sample, nr=nr))
        # fpileup.append("results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup".format(sample=sample, nr=nr))
        var_variables.append("results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var".format(sample=sample, nr=nr))
        make_contig_consensus.append("results/A05_consensus_contigs/{sample}/Contig{nr}.fasta".format(sample=sample, nr=nr))
        blat_variables.append("results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl".format(sample=sample, nr=nr))
        extract_hits_psl.append("results/A07_mapped_exons/{sample}/".format(sample=sample, nr=nr))

rule all:
    input:
        # bam,
        # sort_bam,
        # fpileup,
        var_variables,
        make_contig_consensus,
        blat_variables,
        extract_hits_psl

rule convert_to_fBAM:
    input:
        "results/A04_mapped_contigs/{sample}/sam/Contig{nr}_AT.sam"
    output:
        "results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
    output:
        "results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        script = "src/read_var.py",
        file = "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
    output:
        "results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
    params:
        "{sample}"
    shell:
        "python3 {input.script} {params}"

rule BLAT_assembled:
    input:
        "data/exons/exons_AT.fasta",
        "results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
    output:
        "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
    shell:
        "blat "
        "-t=dnax "
        "-q=dnax "
        "-stepSize=5 "
        "-repMatch=2253 "
        "-minScore=0 "
        "-minIdentity=0 "
        "{input} {output}"

rule extract_hits_psl:
    input:
        script = "src/extract_hits_psl.py"
        # file = "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
    output:
        "results/A07_mapped_exons/{sample}/"
    params:
        "{sample}"
    shell:
        "python {input.script} {params}"

config_contigs.yaml:

SRR8528347:
    - 1
    - ...
    - 5
SRR8528348:
    - 1
    - ...
    - 5
...

Now calling them from the .yaml is working, but the rules should be run in the same order as written (from top to bottom). When running this, the rules are run in a different order and therefore gives an error because the files don't exist yet. I read that the output of the order before should be the same as the input after, but it is not working.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Elboo
  • 1
  • 1
  • Regarding your last error: `expand` creates a list of strings, but `configfile` is supposed to be the path to a file (that can be represented as a string), not a list of strings. If you actually have one configuration file per sample, that means that your workflow is meant to process only one sample at a time. I wrote this answer where I tried to explain various useful things to understand how snakemake works: https://stackoverflow.com/questions/50198628/snakemake-confusion-on-how-to-access-config-files-properly/50216057#50216057 Maybe it can be useful for you. – bli Jul 06 '20 at 07:31
  • Thank you! I looked at it and did help my understanding a bit more, but I'm still stuck. I see you use something like 'config[sample][wildcards.sample]'. I think if I can obtain my values from the config file like this, it would help. But I don't understand how to get these values (see edited post above). – Elboo Jul 23 '20 at 09:44
  • I'm not sure to understand your problem regarding execution order. Could you be more specific? In Snakemake, execution order will be determined based on explicit dependencies between rules (i.e. in terms input and output sections). If there is no such dependency, you have no guarantee on the execution order. – bli Jul 31 '20 at 10:24
  • So when I run this snakefile, it will run the rule convert_to_fBAM until convert_to_fpileup. But instead of running SNP_calling afterwards, it skips this rule and goes to the next rule 'make_contig_consensus'. An error will then occur, because "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var" does not exist. I added my current code to the main question (31/07/2020) – Elboo Jul 31 '20 at 11:24
  • That's strange. As much as I can see, the input of `make_contig_consensus` depends on the output of `SNP_calling`, so Snakemake should not skip directly to `make_contig_consensus`. Can you add the exact error messages in your post? – bli Jul 31 '20 at 15:11

1 Answers1

0

Here is an attempt at answering the question raised in your last edit (23/07/2020).

If your configuration file is as you show, you should have a config Python dict that contains a "contigs" entry, which is in turn a dict that can associate lists of numbers to sample identifiers.

You also generate a list of sample identifiers, and store it in the list SAMPLES.

To "drive" the pipeline, we decide which files will be the final goal, and that will be given as input of the all rule. Here, you want to generate a list of output files that follow this pattern:

"results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"

where the {sample} placeholders will take the values in the SAMPLE list, and where the {nr} placeholders will take values depending on the value of the {sample} placeholder.

(Note: I use "placeholder" instead of "wildcard" to help making it clear that there is no "magic" involved with the snakemake wildcards object here. The wildcards objects are defined in each rule instance, but here we are just preparing lists of files, in "plain" Python, outside of rules.)

Let's do it in plain Python, not using expand.

Start with an empty list:

var_variables = []

Now make a main loop for the samples:

for sample in SAMPLES:
    # Note: No `wildcards` here. We are not in a rule.
    # `sample` is just a (loop) variable in Python,
    # referring to one of the character strings in `SAMPLES`
    contig_nrs = config["contigs"][sample]
    for nr in contig_nrs:
        # I'm using python 3.6+ "formatted strings" to have the placeholders
        # replaced with the corresponding variable's values
        var_variables.append(f"results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var")

And now, after this double loop is done, we can provide this list of files to the driving rule:

rule all:
    input:
         var_variables

There could be ways to use nested expands and double curly braces syntax somewhere ({{}}) to do the same thing, but if you know some basics of Python (lists, dicts, strings...), you can do without it, and it may be simpler to understand. (expand is an utility provided by Snakemake, it doesn't exist in standard Python.)

As for the wildcards objects, these will exist only inside rules, and the existing attributes will be determined by matching the pattern of the output section with the name of the actual file that the rule instance will have to generate.

For more on this, you can have a look at this proposal I had made to improve Snakemake's documentation: https://bitbucket.org/snakemake/snakemake/pull-requests/307/documentation-to-help-understand-expand/diff

bli
  • 7,549
  • 7
  • 48
  • 94
  • Thank you so much for helping me out. It worked with the main loop. Although, this works on my laptop. But when I try to run this script on a ssh server (by git clone the exact same script), I get an 'invalid syntax' error on the 'expand' line, which disappears when I delete the 'f' from the expand(f"...") line. Although, if I delete this, the line won't work as a wildcard anymore. Have you any idea how to solve this? – Elboo Jul 30 '20 at 08:10
  • @Elboo The invalid syntax is likely due to the version of Python being lower than 3.6 on the server. Instead of prefixing the string with "f", use the `.format` method: `"results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var".format(sample=sample, nr=nr)` – bli Jul 30 '20 at 09:33
  • I have another question actually. Now it's working fine with calling them via de config files, but the rules should be run in the correct order. However, it's not doing that anymore. My script is posted in the main question. I know that the output of the rule before should be the same as the input of the next rule to make them run in the correct order, but it's not working. – Elboo Jul 31 '20 at 09:59