0

I would like to combine two command lines as one single to avoid the intermediate files.

workdir: "/path/to/workdir/"

rule all:
    input: 
        "my.filtered.vcf.gz"

rule bedtools:
    input:
        invcf="/path/to/my.vcf.gz",
        bedgz="/path/to/my.bed.gz"
    output:
        outvcf="my.filtered.vcf.gz"
    shell:
        "/Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa |"
        "/Tools/bcftools/bcftools annotate -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') > {output.outvcf}"

I am getting invalid syntax error. I would appreciate if you could explain how to combine multiple shell lines in snakemake.

user3224522
  • 1,119
  • 8
  • 19

2 Answers2

3

You probably get an invalid syntax because of the " you use in your shell here: Description="Gene name">. This closes your shell. You can either escape these quotes or use the """ syntax:

rule bedtools:
    input:
        invcf="/path/to/my.vcf.gz",
        bedgz="/path/to/my.bed.gz"
    output:
        outvcf="my.filtered.vcf.gz"
    shell:
        "/Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa |"
        "/Tools/bcftools/bcftools annotate -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">') > {output.outvcf}"

or

rule bedtools:
    input:
        invcf="/path/to/my.vcf.gz",
        bedgz="/path/to/my.bed.gz"
    output:
        outvcf="my.filtered.vcf.gz"
    shell:
        """
        /Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa | /Tools/bcftools/bcftools annotate -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') > {output.outvcf}
        """

Note that you can use multi line with """. Example without pipes:

shell:
    """
    bedtools .... {input} > tempFile 
    bcftools .... tempFile > tempFile2
    whatever .... tempFile2 > {output}
    """
Eric C.
  • 3,310
  • 2
  • 22
  • 29
  • thanks Eric, this works now, but it is giving me the intermediate output, not the final one....I guess somethins is missing :/ – user3224522 Feb 19 '20 at 09:50
  • Ok I added -a {input.bedgz} in the shell line: /Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa | /Tools/bcftools/bcftools annotate -a {input.bedgz} -c CHROM,FROM,TO,GENE -h <(echo '##INFO=') > {output.outvcf} – user3224522 Feb 19 '20 at 09:57
2

Escaping the double quotes is the problem, but to add a little more on formatting and pipes.

I prefer the syntax of wrapping each line in " so that the lines can be spaced better:

rule bedtools:
    input:
        invcf="/path/to/my.vcf.gz",
        bedgz="/path/to/my.bed.gz"
    output:
        outvcf="my.filtered.vcf.gz"
    shell:
        "/Tools/bedtools2/bin/bedtools "
           "intersect "
           "-a {input.invcf} "
           "-b {input.bedgz} "
           "-header -wa "
        "| /Tools/bcftools/bcftools "
           "annotate "
           "-c CHROM,FROM,TO,GENE "
           "-h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">') "
        "> {output.outvcf}"

I find that clearer to see each argument and easier to change by moving lines around. But note that the trailing space of each line is required and you have to use an explicit newline, \n, if you want a separate command. When the prompt is printed the output is nicely formated. With the """ syntax you have to escape each newline with \ at the end and spaces at the start of the line are retained in printing.

If you have lot's of pipe work to do, check out the pipe flag. You write your first step as a rule and snakemake produces a named pipe between the rules, submitting them as a group:

rule bedtools_intersect:
    input:
        invcf="/path/to/my.vcf.gz",
        bedgz="/path/to/my.bed.gz"
    output:
        outvcf=pipe("my.intersected.vcf.gz")
    shell:
        "/Tools/bedtools2/bin/bedtools "
           "intersect "
           "-a {input.invcf} "
           "-b {input.bedgz} "
           "-header -wa "
        "> {output.outvcf}"

rule bcftools_annotate:
    input:
        invcf="my.intersected.vcf.gz"
    output:
        outvcf="my.filtered.vcf.gz"
    shell:
        "/Tools/bcftools/bcftools "
           "annotate "
           "-c CHROM,FROM,TO,GENE "
           "-h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">') "
           "{input.invcf} "
        "> {output.outvcf}"

The advantage is you can reuse each rule throughout your pipeline to intersect or annotate while avoiding temporary files.

Troy Comi
  • 1,579
  • 3
  • 12