4

I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.

I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.

#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647

Snakefile below:

from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil

configfile: "config.json"

data_dir=os.getcwd()

units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())

#print(samples)

rule all:
    input:
           expand("out/{sample}.fastq.gz",sample=samples)

rule clean:
     shell: "rm -rf .snakemake/"

include: 'rules/download_sample.smk'

download_sample.smk

rule download_sample:
    """
    Download RNA-Seq data from SRA.
    """
    input: "{sample}"
    output: expand("out/{sample}.fastq.gz", sample=samples)
    params:
        outdir = "out",
        threads = 16
    priority:85
    shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir}  --gzip "

I have tried many different variants of the above code, but somewhere I am getting it wrong.

What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed

parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip

This is the error I get

snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'

Thanks in advance

user10101904
  • 427
  • 2
  • 12
  • 1
    The `input` section of a rule must refer to file (or patterns that can refer to files once wildcards have been substituted by the corresponding values). Snakemake needs this because it connects rules based on the files that they need (`input` section) and produce (`output` section). The SRA id cannot belong to the `input` section, because it is not a file. It could be in the parameter section, but it seems that you do not need this here, because the SRA id simply corresponds to the `{sample}` wildcard, which you can directly use in the `shell` section (see @dariober's answer). – bli Jun 11 '19 at 14:53
  • Thanks @dariober - Your solution worked – user10101904 Jun 12 '19 at 01:51

2 Answers2

3

It seems to me that what you need is to access the sample wildcard using the wildcards object:

rule all:
    input: expand("out/{sample}_fastq.gz", sample=samples)

rule download_sample:
    output: 
        "out/{sample}_fastq.gz"
    params:
        outdir = "out",
        threads = 16
    priority:85
    shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir}  --gzip "
dariober
  • 8,240
  • 3
  • 30
  • 47
1

The first solution could be to use the run: section of the rule instead of the shell:. This allows you to employ python code:

rule download_sample:
    # ...
    run:
        for input_file in input:
            shell(f"parallel-fastq-dump --sra-id {input_file} --threads {params.threads} --outdir {params.outdir} --gzip")

This straightforward solution however is not idiomatic. From what I can see, you have a one-to-one relationship between input samples and output files. In other words to produce one out/{sample}_fastq.gz file you need a single {sample}. The best solution would be to reduce your rule to the one that makes a single file:

rule download_sample:
    input: "{sample}"
    output: "out/{sample}_fastq.gz"
    params:
        outdir = "out",
        threads = 16
    priority:85
    shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "

The rule all: now requires all targets; the rule download_sample downloads a single sample, the Snakemake workflow does the rest: it constructs a graph of dependences and creates one instance of the rule download_sample per sample. Moreover, if you wish it can run these rules in parallel.

Dmitry Kuzminov
  • 6,180
  • 6
  • 18
  • 40
  • Thank you very much. Your solution makes sense, but I get an error. ```Wildcards in input files cannot be determined from output files:'sample'``` Also, to clarify, every line in my {input_file} is an individual command on the shell To clarify, if I have ```sample_a``` and ```sample_b``` as the 2 lines in my input_file ```Samples.txt```, then there will be 2 different shell commands: ```parallel-fastq-dump --sra-id sample_a .....``` ```parallel-fastq-dump --sra-id sample_b .....``` – user10101904 Jun 11 '19 at 11:22
  • I provided you 2 solutions, which one produces an error? Please update your question. – Dmitry Kuzminov Jun 11 '19 at 13:19
  • Solution 2 produces the error (I would need something similar to Solution 2 as I intend to run it in parallel). I also see something similar in this SO question which looks unanswered: https://stackoverflow.com/questions/51946235/snakemake-error-when-trying-to-generate-multiple-output-files – user10101904 Jun 11 '19 at 14:14
  • @user10101904, please show full example of the code that produces the error. – Dmitry Kuzminov Jun 11 '19 at 14:33
  • I have updated my original question with the code and errors. Thanks for helping me – user10101904 Jun 11 '19 at 15:00
  • @user10101904, please study my solution precisely. You don't need to expand the output. – Dmitry Kuzminov Jun 11 '19 at 15:20
  • Dmitry - sorry I misinterpreted and I expanded my input. Even your answer works but because of SO rules I think I can only mark 1 answer correct. Thanks Dmitry – user10101904 Jun 12 '19 at 01:54