0

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.

After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?

Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?

log:    
    expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])

where in the config.yaml

run:
    "run123"

Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?

log:
    "fastq/fastqc/"+config["run"]+"_fastqc_log.txt"

log:
    "fastq/fastqc/{config["run"]}_fastqc_log.txt"
olala
  • 4,146
  • 9
  • 34
  • 44
  • Hi olala, I feel your question doesn't really stand out of text. Can you please highlight your question in a one or two sentences in the end ? – Vasif Nov 18 '16 at 02:15
  • @Vasif thanks for the comment. I just edited it. – olala Nov 18 '16 at 18:12

2 Answers2

4

Here is an example of small workflow:

# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]

rule all:
    input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)

rule peak_calling:
    input: control="{control}.sam", treatment="{treatment}.sam"
    output: "{treatment}_vs_{control}.bed"
    shell: "touch {output}"

rule mapping:
    input: "{samples}.fastq"
    output: "{samples}.sam"
    shell: "cp {input} {output}"

I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".

As for the last part, the right way to put it would be the first one:

log:
    "fastq/fastqc/" + config["run"] + "_fastqc_log.txt"

But again, snakemake can deduce it from your target (the rule all, in my example).

rule mapping:
    input: "{samples}.fastq"
    output: "{samples}.sam"
    log: "{samples}.log"
    shell: "cp {input} {output}"

Hope this helps!

rioualen
  • 948
  • 8
  • 17
  • I also had problems in my first attempts at using snakemake because of using expand in the wrong places. Using expand only for the final results file and letting snakemake infer the rest seems indeed to be the right thing to do. – bli Nov 18 '16 at 13:46
  • 1
    Thanks for the point! I can get this to work if I know the exact input file name and they stay the same through the whole pipeline and just let snakemake infer from them.. but I have programs that generated a random name and I want to pass that random name from config.yaml file to output as a prefix. it gave me a lot headache there.. maybe it's because of i'm used to the way that I can pass any parameter into a shell script – olala Nov 18 '16 at 18:15
  • @olala Could this be of any help: https://bitbucket.org/snakemake/snakemake/wiki/FAQ#markdown-header-how-do-i-run-my-rule-on-all-files-of-a-certain-directory ? (I mean the `glob_wildcards` thing, which could be used to infer the random part of the filename.) – bli Nov 21 '16 at 12:28
  • @oala if the prefix comes from the config file, you can do something like this: `expand("{prefix}/path/to/output.{{name}}.txt", prefix=config["prefix"])`. You use expand to format the path, but mask the real wildcard (here `{name}`, such that expand does not complain that no value is given for it. – Johannes Köster Nov 22 '16 at 08:44
  • @JohannesKöster, thanks for the answer here as well, I'm having a related question regarding the input for rule all, i.e., the target files. If I use expand double quotes in the output of a rule, how should I write the input for rule all? It seems to me that I can't use double quotes there. – olala Nov 29 '16 at 23:51
  • Not sure what you mean. expand works always the same, regardless of the context. – Johannes Köster Dec 04 '16 at 10:01
0

You can use f-strings:

If this is you folder_with_configs/some_config.yaml:

var: value

Then simply

configfile:
    "folder_with_configs/some_config.yaml"

rule one_to_rule_all:
    output:
        f"results/{config['var']}.file"
    shell:
        "touch {output[0]}"

Do remember about python rules related to nesting different types of apostrophes. config in the smake rule is a simple python dictionary.

If you need to use additional variables in a path, e.g. some_param, use more curly brackets.

rule one_to_rule_all:
    output:
        f"results/{config['var']}.{{some_param}}"
    shell:
        "touch {output[0]}"

enjoy

MatteoLacki
  • 423
  • 4
  • 4