0

I want to connect the realigner process with the indel reallignement. This is the rules:

rule gatk_IndelRealigner:
    input:
        tumor="mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
        normal="mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
        id="mapped_reads/merged_samples/operation/{tumor}_{normal}.realign.intervals"

    output:
        "mapped_reads/merged_sample/CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
        "mapped_reads/merged_sample/CoClean/{normal}.sorted.dup.reca.cleaned.bam",
    params:
        genome=config['reference']['genome_fasta'],
        mills= config['mills'],
        ph1_indels= config['know_phy'],
    log:
        "mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
    threads: 8
    shell:
        "gatk -T IndelRealigner -R {params.genome}  "
        "-nt {threads} "
        "-I {input.tumor} -I {input.normal}  -known {params.ph1_indels} -known {params.mills} -nWayOut  .cleaned.bam --maxReadsInMemory 500000 --noOriginalAligmentTags --targetIntervals {input.id} >& {log}  "

This is the error:

Not all output files of rule gatk_IndelRealigner contain the same wildcards.

I suppose I need to use also the {tumor}_{normal} but I can't use. Snakemake:

rule all:
      input:expand("mapped_reads/merged_samples/CoClean/{sample}.sorted.dup.reca.cleaned.bam",sample=config['samples']),
           expand("mapped_reads/merged_samples/operation/{sample[1][tumor]}_{sample[1][normal]}.realign.intervals", sample=read_table(config["conditions"], ",").iterrows())

config.yml

conditions: "conditions.csv"

conditions.csv

tumor,normal
411,412

Here you can see an example of the code (for testing purpose) gave the same error:

directory

$ tree  prova/
prova/
├── condition.csv
├── config.yaml
├── output
│   ├── ABC.bam
│   ├── pippa.bam
│   ├── Pippo.bam
│   ├── TimBorn.bam
│   ├── TimNorm.bam
│   ├── TimTum.bam
│   └── XYZ.bam
└── Snakefile

this is snakemake

$ cat prova/Snakefile 
from pandas import read_table

configfile: "config.yaml"

rule all:
    input:
         expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows()),
         expand("CoClean/{sample[1][tumor]}.bam",  sample=read_table(config["sampleFILE"], " ").iterrows()),
         expand("CoClean/{sample[1][normal]}.bam", sample=read_table(config["sampleFILE"], " ").iterrows())

rule gatk_RealignerTargetCreator:
     input:
         "{pathGRTC}/{normal}.bam",
         "{pathGRTC}/{tumor}.bam",
     output:
         "{pathGRTC}/{tumor}_{normal}.bam"
 #    wildcard_constraints:
 #        tumor = '[^_|-|\/][0-9a-zA-Z]*',
 #        normal = '[^_|-|\/][0-9a-zA-Z]*'
     run:
         call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)



rule gatk_IndelRealigner:
    input:
        t1="output/{tumor}.bam",
        n1="output/{normal}.bam",

    output:
        "CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
        "CoClean/{normal}.sorted.dup.reca.cleaned.bam",
    log:
        "mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
    threads: 8
    shell:
        "gatk -T IndelRealigner -R {params.genome}  "
        "-nt {threads} -I {input.t1} -I {input.n1}  & {log}  "

conditions.csv

$ more condition.csv 
tumor normal
TimTum TimBorn
XYZ ABC
Pippo pippa

Thanks for any suggestion

mau_who
  • 315
  • 2
  • 13

1 Answers1

1

I'm not convinced you have to include two input files to the GATK IndelRealigner. Building from that assumption, you can alter the rule to become indifferent to the "type (tumor vs normal)" of file it is process. I read the specs here. Please, if I am wrong, stop reading and correct me.

rule gatk_IndelRealigner:
    input:
        inputBAM="output/{sampleGATKIR}.bam",

    output:
        "CoClean/{sampleGATKIR}.sorted.dup.reca.cleaned.bam",
    log:
        "mapped_reads/merged_samples/logs/{sampleGATKIR}.indel_realign_2.log"
    params:
        genome="**DONT FORGET TO ADD THIS""
    threads: 8
    shell:
        "gatk -T IndelRealigner -R {params.genome}  "
        "-nt {threads} -I {input.inputBAM}   & {log}  "

By changing the rule to be bam-type agnostic (made up word) you gain two advantages, and there is one main disadvantage.

Advantages:

  1. Now we only have a single wild-card

  2. We can run the alignment of each .bam file independently, which with a devoted CPU should hopefully make things faster.

Disadvantage:

  1. We are now likely putting two copies of the genome onto memory somewhere, since the threads are now being run as separate processes, no more memory sharing of the genome file. (In my previous position, hardware availability wasn't typically an issue, so I heavily am biased towards splitting everything up)

The reason I think that the GATK documentation has it setup to accept multiple 'bam' files is because if you are just using it as a 1-off call you want to list all the files at the same time. We are not needing that since we are automating the call process. We're indifferent to 1 call or 100 calls.

Community
  • 1
  • 1
TBoyarski
  • 441
  • 3
  • 9
  • thanks so much.. Unfortunately I need to use tumor and normal sample for indell calling. This is importan for me also for use in other rules where I need to use that samples. The procedure is a [Co-cleaning](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/) .. – mau_who Sep 26 '17 at 20:25
  • 1
    I'm not sure I understand. You can use the tumor and normal sample else where for indel calling. But within IndelRe, even the shell call doesn't require you to distinguish between tumor and normal, as you provide "-I' for both arguments. So just just a generic wildcard in IndelRealigner, use your tumor and normal wildcards elsewhere. This will l work, Snakemake figures it out :) I still dont see "gatk_IndelRealigner" needing tumor-normal pairs. The wildcard names do not have to be the same for Snakemake to connect two rules together. (GATKIR --> tumor) and (GATKIR -->normal). – TBoyarski Sep 26 '17 at 21:04
  • Ok I will try butI don't understand how prepare GATKIR? However the possibility of distinguish tumor and normal is important for the oher seps (Mutect for example). So the principle on how to resolve remain – mau_who Sep 27 '17 at 05:56
  • Any given rule, it if needs to determine which is the tumor and which is the normal, should do so itself. Typically the naming conventions for tumor and normals are too random for there to be any pattern to follow. I'n my rules which require the assertion of tumor and normal status, I use lookups against my current input file list to determine it. It's a little obfuscated with everything going on, but [check out the function calll getSampleName here](https://github.com/tboyarski/BCCRC-Snakemake/blob/master/modules/vcfUtil/getVcfTable). See how generalizable the inputs are for the rule as well? – TBoyarski Sep 28 '17 at 15:38