8

I'm using snakemake for the first time in order to build a basic pipeline using cutadapt, bwa and GATK (trimming ; mapping ; calling). I would like to run this pipeline on every fastq file contained in a directory, without having to specify their name or whatever in the snakefile or in the config file. I would like to succeed in doing this.

The first two steps (cutadapt and bwa / trimming and mapping) are running fine, but I'm encountering some problems with GATK.

First, I have to generate g.vcf files from bam files. I'm doing this using these rules:

configfile: "config.yaml"

import os
import glob

rule all:
    input:
        "merge_calling.g.vcf"

rule cutadapt:
    input:
        read="data/Raw_reads/{sample}_R1_{run}.fastq.gz", 
        read2="data/Raw_reads/{sample}_R2_{run}.fastq.gz" 
    output:
        R1=temp("trimmed_reads/{sample}_R1_{run}.fastq.gz"),
        R2=temp("trimmed_reads/{sample}_R2_{run}.fastq.gz") 
    threads:
        10
    shell:
        "cutadapt -q {config[Cutadapt][Quality_value]} -m {config[Cutadapt][min_length]} -a {config[Cutadapt][forward_adapter]} -A  {config[Cutadapt][reverse_adapter]} -o {output.R1} -p '{output.R2}' {input.read} {input.read2}"

rule bwa_map:
    input:
        genome="data/genome.fasta",
        read=expand("trimmed_reads/{{sample}}_{pair}_{{run}}.fastq.gz", pair=["R1", "R2"]) 
    output:
        temp("mapped_bam/{sample}_{run}.bam")
    threads:
        10
    params:
        rg="@RG\\tID:{sample}\\tPL:ILLUMINA\\tSM:{sample}"
    shell:
        "bwa mem -t 2 -R '{params.rg}' {input.genome} {input.read} | samtools view -Sb - > {output}"

rule picard_sort:
    input:
        "mapped_bam/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "java -Xmx4g -jar /home/alexandre/picard-tools/picard.jar SortSam I={input} O={output} SO=coordinate VALIDATION_STRINGENCY=SILENT"

rule picard_rmdup:
    input:
        bam="sorted_reads/{sample}.bam"
    output:
        "rmduped_reads/{sample}.bam",
        "picard_stats/{sample}.bam"
    params:
        reads="rmduped_reads/{sample}.bam",
        stats="picard_stats/{sample}.bam",
    shell:
        "java -jar -Xmx2g /home/alexandre/picard-tools/picard.jar MarkDuplicates "
        "I={input.bam} "
        "O='{params.reads}' "
        "VALIDATION_STRINGENCY=SILENT "
        "MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 "
        "REMOVE_DUPLICATES=TRUE "
        "M='{params.stats}'"

rule samtools_index:
    input:
        "rmduped_reads/{sample}.bam"
    output:
        "rmduped_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

rule GATK_raw_calling:
    input:
        bam="rmduped_reads/{sample}.bam",
        bai="rmduped_reads/{sample}.bam.bai",
        genome="data/genome.fasta"
    output:
        "Raw_calling/{sample}.g.vcf",
    shell:
        "java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -ploidy 2 --emitRefConfidence GVCF -T HaplotypeCaller -R {input.genome} -I {input.bam} --genotyping_mode DISCOVERY -o {output}"

These rules work fine. For example, if I have the files : Cla001d_S281_L001_R1_001.fastq.gz Cla001d_S281_L001_R2_001.fastq.gz

I can create one bam file (Cla001d_S281_L001_001.bam) and from that bam file create a GVCF file (Cla001d_S281_L001_001.g.vcf). I have a lot of sample like this one, and I need to create one GVCF file for each, and then merge these GVCF files into one file. The problem is that I'm unable to give the list of the file to merge to the following rule:

rule GATK_merge:
    input:
        ???
    output:
        "merge_calling.g.vcf"
    shell:
        "java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar "
        "-T CombineGVCFs "
        "-R data/genome.fasta "
        "--variant {input} "
        "-o {output}"

I tried several things in order to do that, but cannot succeed. The problem is the link between the two rules (GATK_raw_calling and GATK_merge that is supposed to merge the output of GATK_raw_calling). I can't output one single file if I'm specifying the output of GATK_raw_calling as the input of the following rule (Wildcards in input files cannot be determined from output files), and I'm unable to make a link between the two rules if I'm not specifying these files as an input...

Is there a way to succeed in doing that? The difficulty is that I'm not defining a list of names or whatever, I think.

Thanks you in advance for your help.

Alexandre.S
  • 97
  • 1
  • 8
  • If your workflow is not too long, maybe you could post it entirely. Also, you may be interested in the `glob_wildcards` fuction mentioned in this FAQ entry: https://snakemake.readthedocs.io/en/stable/project_info/faq.html#glob-wildcards ("How do I run my rule on all files of a certain directory?") – bli Jul 05 '17 at 14:10
  • The output of `bwa_map` and the input of `picard_sort` do not have the same form. Maybe this works, but it may be confusing because the `{sample}` wildcards will take different forms in the two rules. But I suspect there will be a problem because the `{run}` wildcard cannot be inferred, or something like that. – bli Jul 06 '17 at 09:35
  • These two rules work, but it is true that this {run} statement is disturbing. I try to improve this with a regex that is supposed to match my fastq files at the begining. I saw this in a tutorial : SAMPLES, = glob_wildcards(join(FASTQ_DIR, '{sample,Samp[^/]+}.R1.fastq.gz')) And I now try to adapt this kind of command for my workflow. The idea is to end up with files named only by the sample name (for example : Cla001d.bam). – Alexandre.S Jul 06 '17 at 09:48
  • At what step do you "merge" runs of a same sample ? Also, the input of `all` is different from the output of `GATK_merge` in what you showed. – bli Jul 06 '17 at 09:56
  • My "all" rule was not the right one, I'll change that. I merge the R1 and R2 files at the first step of the mapping : rule bwa_map – Alexandre.S Jul 06 '17 at 10:10
  • Yes, but R1 and R2 are not the same thing as `run` in, for instance `data/Raw_reads/{sample}_R1_{run}.fastq.gz`. If this `run` part is expected to always be "001", then it will be simpler not to use a wildcard here. I don't know if this is guaranteed, though. – bli Jul 06 '17 at 10:13
  • Yet another comment: the value of `threads` will not have any effect in your shell commands if you don't use it explicitly (for instance, as a parameter for a program that can use multiple cores). Otherwise, the only effect will be that snakemake will not run rules in parallel if the sum of their `threads` values exceeds the value given at its `-j` option. This can be a trick to indirectly limit the use of other resources, like RAM: if you know that some rules are high RAM consumers, you may set a high `threads` value to them in order not to have several instances of such rules running. – bli Jul 06 '17 at 10:19
  • The {Run} statement will probably be the same 99% of the time, you are totaly right. The thread statement was just put here because it was shown in the tutorial. I should have remove it. Thanks for teh advice! I have steps that are more RAM consuming than others and this statement will be usefull ! – Alexandre.S Jul 06 '17 at 10:21

2 Answers2

4

You can try to generate a list of sample IDs using glob_wildcards on the initial fastq.gz files:

sample_ids, run_ids = glob_wildcards("data/Raw_reads/{sample}_R1_{run}.fastq.gz")

Then, you can use this to expand the input of GATK_merge:

rule GATK_merge:
    input:
        expand("Raw_calling/{sample}_{run}.g.vcf",
               sample=sample_ids, run=run_ids)

If the same run ID always come with the same sample ID, you will need to zip instead of expanding, in order to avoid non-existing combinations:

rule GATK_merge:
    input:
        ["Raw_calling/{sample}_{run}.g.vcf".format(
            sample=sample_id,
            run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)]
bli
  • 7,549
  • 7
  • 48
  • 94
  • that does not work for me. Did the very same thing: `barcodes, = glob_wildcards('/path/to/fastq_pass/{barcode}')`, then on my aggregate rule I have `input: expand("nanopolish/{barcode}/{barcode}.consensus.fasta", barcode=barcodes)` but I am getting `AttributeError: 'Wildcards' object has no attribute 'barcode'` – BCArg Jun 04 '21 at 08:48
  • @BCArg What wildcards exist depend on the output file patterns. Maybe you can open a separate question with a more detailed description of your problem. – bli Jun 09 '21 at 06:54
2

You can achieve this by using a python function as an input for your rule, as described in the snakemake documentation here.

Could look like this for example:

# Define input files
def gatk_inputs(wildcards):
    files = expand("Raw_calling/{sample}.g.vcf", sample=<samples list>)
    return files

# Rule
rule gatk:
    input: gatk_inputs
    output: <output file name>
    run: ...

Hope this helps.

rioualen
  • 948
  • 8
  • 17
  • Thanks you for this answer. The problem is that I have no list of files, so the "samples=" will not work. The only input for this worflow is a directory containing the fastq files. I also tried to use "os.listdir(Raw_calling/)" in the function or as an input for my rule. but snakemake told me that the repertory "Raw_calling" does not exists. I think it is because the python code is executed before the workflow run and so it's true that the repertory has not been created yet. Or maybe because he can't connect the merging rule with the previous one ? – Alexandre.S Jul 05 '17 at 08:57
  • {sample} should be the same as in your rule GATK_raw_calling, a list of files, identifiers or else, that your define somewhere. If rule GATK_raw_calling is working, then this should not be an issue... – rioualen Jul 05 '17 at 10:11
  • @Alexandre.S You say that you don't have a list of files, but I suppose you somehow have a list of sample IDs that you use in the input of a top rule in order to drive the execution of the `GATK_raw_calling` rule. It would help if you could post more of your workflow. Ideally a working example that include the rules that work up to `GATK_raw_calling`. – bli Jul 05 '17 at 14:25
  • Hello The difficulty is that I'm trying to create a pipeline that work with only one input : the directory of the fastq files. That mean that I have no list of Ids. Only a repertory. The idea was to make this pipeline usable on most of the data I have to analyse in less than 1 minute -> just put the fastq in the right directory and run. I will post something more detailed, thanks for your answer ! – Alexandre.S Jul 05 '17 at 14:51
  • Any update? I'm still trying to build a pipeline with the same goal: Put input files in a specific folder and run the pipeline. – fakechek Jan 03 '18 at 13:42