5

I am very new to snakemake and also not so fluent in python (so apologies this might be a very basic stupid question):

I am currently building a pipeline to analyze a set of bamfiles with atlas. These bamfiles are located in different folders and should not be moved to a common one. Therefore I decided to provide a samplelist looking like this (this is just an example, in reality samples might be on totaly different drives):

Sample     Path
Sample1    /some/path/to/my/sample/
Sample2    /some/different/path/

And load it in my config.yaml with:

sample_file: /path/to/samplelist/samplslist.txt

Now to my Snakefile:

import pandas as pd

#define configfile with paths etc.
configfile: "config.yaml"

#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]

rule all:
    input:
        expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)

#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':

rule indexBam:
    input: 
        "{path}{sample}.bam"
    output:
        "{path}{sample}.bam.bai"
    shell:
        "samtools index {input}"

#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
    input:
        bam="{path}{sample}.bam",
        bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
    params:
        prefix="analysis/BAMDiagnostics/{sample}"   
    output:
        "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
        "analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
        "analysis/BAMDiagnostics/{sample}_MQ.txt",
        "analysis/BAMDiagnostics/{sample}_readLength.txt",
        "analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
    message:
        "running BamDiagnostics...{wildcards.sample}"
    shell:
        "{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"

rule summary:
    input:
        index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
        bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
    output:
        "{path}{sample}.summary.txt"
    shell:
        "echo -e '{input.index} {input.bamd}"

I get the error

WildcardError in line 28 of path/to/my/Snakefile: Wildcards in input files cannot be determined from output files: 'path'

Can anyone help me?
- I tried to solve this problem with join, or creating input functions but I think I am just not skilled enough to see my error...
- I guess the problem is, that my summary-rule does not contain the tuplet with the {path} for the bamdiagnostics-output (since the output is somewhere else) and cannot make the connection to the input file or so...
- Expanding my input on bamdiagnostics-rule makes the code work, but of course takes every samples input to every samples output and creates a big mess: In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.

Manavalan Gajapathy
  • 3,900
  • 2
  • 20
  • 43
Trillian Astra
  • 129
  • 1
  • 10
  • so what's stopping you from applying `expand` for bam's `{path}` in `rule bamdiagnostics` similar to bai's `{path}`. That's what the error obtained is referring to. – Manavalan Gajapathy Mar 20 '18 at 18:19
  • since I only need the "bai" part to appear in the input (without actually using it), I can state the expand function here. But the bamfiles are to be treated independently. Long answer under your answer below ;) – Trillian Astra Mar 20 '18 at 23:09

1 Answers1

5

Based on the atlas doc, it seems like what you need is to run each rule separately for each sample, the complication here being that each sample is in separate path.

I modified your script to work for above case (see DAG). Variables in the beginning of script were modified to make better sense. config was removed for demo purposes, and pathlib library was used (instead of os.path.join). pathlib is not necessary, but it helps me keep sanity. A shell command was modified to avoid config.

import pandas as pd
from pathlib import Path

df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
SAMPLES = df.index
BAM_PATH = df["Path"]
# print (BAM_PATH['sample1'])

rule all:
    input:
        expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)


rule indexBam:
    input:
        str( Path("{path}") / "{sample}.bam")
    output:
        str( Path("{path}") / "{sample}.bam.bai")
    shell:
        "samtools index {input}"

#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
    input:
        bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
        bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
    params:
        prefix="analysis/BAMDiagnostics/{sample}"
    output:
        "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
        "analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
        "analysis/BAMDiagnostics/{sample}_MQ.txt",
        "analysis/BAMDiagnostics/{sample}_readLength.txt",
        "analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
    message:
        "running BamDiagnostics...{wildcards.sample}"
    shell:
        ".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"

rule summary:
    input:
        bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
        index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
    output:
        str( Path("{path}") / "{sample}.summary.txt")
    shell:
        "echo -e '{input.index} {input.bamd}"
Trillian Astra
  • 129
  • 1
  • 10
Manavalan Gajapathy
  • 3,900
  • 2
  • 20
  • 43
  • Thank you for your answer. As mentioned in my question, the code "works" like this, but the output is a mixture of Sample1.bam being used for Sample1_approximateDepth.txt AND Sample2_approximateDepth.txt and vice versa. I do not believe this is intendet as I thought that the "expand" option is used to give *one* output for *several* inputs?? -- I included the DAG to your solution to clarify where the problem is :) – Trillian Astra Mar 20 '18 at 23:03
  • @EcstasiaTisiphoni So, are you trying to run `rule bamdiagnostics` separately for Sample1 and Sample2? If not, could you list the intended shell command(s) for this rule? – Manavalan Gajapathy Mar 21 '18 at 02:14
  • yes, exactley. running separately for each file. This tool calculates the coverage and the mapping-quality and read length distribution. Therefore I want to calculate it separately for each file. – Trillian Astra Mar 21 '18 at 09:35
  • Thanks a lot for your modified version!! – Trillian Astra Mar 21 '18 at 09:49
  • Glad to be of help. You may want to accept this as answer (tick mark) if it solved your question. – Manavalan Gajapathy Mar 21 '18 at 14:50
  • Couldn't test it yet but looks promising! Will-do then. – Trillian Astra Mar 21 '18 at 18:17
  • I tried it out and for me it gave the error >Input and output files have to be specified as strings or lists of strings< . Maybe a versioning issue? After putting str() around the input and output statements it worked! I editet it in your answer and maked it as solved. Thanks a lot!! – Trillian Astra Mar 22 '18 at 10:28
  • yes, `pathlib` support was [added recently](https://bitbucket.org/snakemake/snakemake/pull-requests/262/support-paths-from-pathlib-for-inputs-and/diff). – Manavalan Gajapathy Mar 22 '18 at 15:08