1

Is there a way to define a snakemake config string in a .yaml file so that it can contain {wildcard} and {param} values, and when that string is used in a shell command, the {<name>} values are substituted with the actual value of "<name>"?

For example, suppose you want a config string to define the format of a string to be passed as an argument to a program:

RG: "ID:{ID} REP:{REP}"

where the above is in a .yaml file, and ID and REP are wildcards, and a shell command will pass the expanded string as an argument to a program.

tedtoal
  • 1,030
  • 1
  • 10
  • 22

4 Answers4

8

Let me try to provide a short answer to the question:

In Snakemake, you can provide functions to params, which take wildcards as argument. In these functions, you can execute any python code, including a format statement to format your config value, e.g.

configfile: "config.yaml"

rule:
    output:
        "plots/myplot.{mywildcard}.pdf"
    params:
        myparam=lambda wildcards: config["mykey"].format(**wildcards)
    shell:
        ...

As you can see, you can use the python unpacking operator and the str.format method to replace the value in the config file. This assumes that config["mykey"] yields a string containing the same wildcard as above, e.g. "foo{mywildcard}bar".

Johannes Köster
  • 1,809
  • 6
  • 8
1

Yes, using a params lambda function:

MACBOOK> cat paramsArgs.yaml
A: "Hello world"
B: "Message: {config[A]}  ID: {wildcards.ID}   REP: {wildcards.REP}"

MACBOOK> cat paramsArgs
configfile: "paramsArgs.yaml"

rule all:
    input: "ID2307_REP12.txt"

def paramFunc(key, wildcards, config):
    return config[key].format(wildcards=wildcards, config=config)

rule:
    output: "ID{ID}_REP{REP}.txt"
    params: A=config["A"], B=lambda wildcards: paramFunc("B", wildcards, config)
    shell:
        """
        echo 'A is {params.A}' > {output}
        echo 'B is {params.B}' >> {output}
        """

MACBOOK> snakemake -s paramsArgs
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   2
    1   all
    2

rule 2:
    output: ID2307_REP12.txt
    jobid: 1
    wildcards: REP=12, ID=2307

Finished job 1.
1 of 2 steps (50%) done

localrule all:
    input: ID2307_REP12.txt
    jobid: 0

Finished job 0.
2 of 2 steps (100%) done

MACBOOK> cat ID2307_REP12.txt 
A is Hello world
B is Message: Hello world  ID: 2307   REP: 12
tedtoal
  • 1,030
  • 1
  • 10
  • 22
0

Here's a param function that let's you expand values from several different snakemake sources in a config string:

def paramFunc(wildcards, input, output, threads, resources, config,
  global_cfg, this_cfg, S):

    return S.format(wildcards=wildcards, input=input, output=output,
        threads=threads, resources=resources, config=config,
        global_cfg=global_cfg, this_cfg=this_cfg)

Here's an example of how to call paramFunc() from within a Snakemake params: section, to expand the value of the config parameter config["XYZ"] and assign it to the parameter named "text", then expand that "text" parameter in a shell command:

   params:
       text=lambda wildcards, input, output, threads, resources:
           paramFunc(wildcards, input, output, threads, resources, config,
                global_cfg, my_local_cfg, config["XYZ"])
   shell: "echo 'text is {params.text}'"

Notice that the last argument to paramFunc() is the parameter value you want to expand, config["XYZ"] in this case. The other arguments are all dictionaries containing values that might be referenced by that parameter value.

You might have defined config["XYZ"] like this, for example, in a .yaml file:

ABC: "Hello world"
XYZ: "ABC is {config[ABC]}"

However, the string XYZ is not limited to expanding values defined in the same file (ABC is expanded here), but you can use other "{}" constructs to access other values defined elsewhere:

Defined in                               Use this construct in param
----------                               ---------------------------
"config" dictionary                      "{config[<name>]}"
wildcards used in the output filename    "{wildcards[<name>]}"
input filename(s)                        "{input}" or "{input[NAME]}" or "{input[#]}"
output filename(s)                       "{output}" or "{output[NAME]}" or "{output[#]}"
threads                                  "{threads}"
resources                                "{resources[<name>]}"
"global_cfg" global config dictionary    "{global_cfg[<name>]}"
"my_local_cfg" module config dictionary  "{this_cfg[<name>]}"

The values "global_cfg" and "my_local_cfg" are two special dictionaries that could be added to assist with modularizing the snakefile.

For "global_cfg", the idea is that you might want to have a dictionary of snakefile-global definitions. In your main snakefile, do this:

include: "global_cfg.py"

And in file global_cfg.py, place global definitions:

global_cfg = {
    "DATA_DIR" : "ProjData",
    "PROJ_DESC" : "Mint Sequencing"
}

Then you can reference these values in parameter strings with e.g.:

"{global_cfg[DATADIR]}"

(the strings must be expanded in a params: section by calling paramFunc())

For "my_local_cfg", the idea is that you might want to place each snakefile rule in a separate file, and have the parameters for that rule also defined in a separate file, so each rule has a rule file and a parameter file. In the main snakefile:

(include paramFunc() definition above)
include: "myrule.snake"
rule all:
    input: "myrule.txt"

In myrule.snake:

include: "myrule.py"

In myrule.py place the config settings for the myrule module:

myrule_cfg = {
    "SPD" : 125,
    "DIST" : 98,
    "MSG" : "Param settings: Speed={this_cfg[SPD]}  Dist={this_cfg[DIST]}"
}

and back in myrule.snake:

include: "myrule.py"
rule myrule:
    params:
        SPD=myrule_cfg["SPD"],
        DIST=myrule_cfg["DIST"],
        # For MSG call paramFunc() to expand {name} constructs.
        MSG=lambda wildcards, input, output, threads, resources:
           paramFunc(wildcards, input, output, threads, resources, config,
               global_cfg, myrule_cfg, myrule_cfg["MSG"])
    message: "{params.MSG}"
    output: "myrule.txt"
    shell: "echo '-speed {params.SPD} -dist {params.DIST}' >{output}"

Note that the paramFunc() function maps the name "myrule_cfg" (varies from one rule to the next) to the fixed name "this_cfg" (same regardless of rule).

Note that I include .py files that define the global_cfg and this_cfg dictionaries. These could instead be defined in .yaml files, but the problem is that they then all end up in one dictionary, "config". It would be nice if the configfile command allowed the dictionary to be specified, e.g.:

configfile: global_cfg="global_cfg.yaml"

Perhaps that feature will be added someday to snakemake.

tedtoal
  • 1,030
  • 1
  • 10
  • 22
0

I realized that additional arguments of **config and **globals() to format() in Johannes Köster's answer can be used to allow expansion of variables defined in the python code of the snakefile, such as variable "ABC" in the following example, and allow expansion of config parameters without using "config" in the expansion. Suppose config.yaml contains:

X: "Hello"
MSG: "config X: {X}   variable ABC: {ABC}   wildcard WW: {WW}"

and you have this snakefile:

configfile: "config.yaml"

rule all:
    input: "test.Goodbye.txt"

rule A:
    output: "test.{WW}.txt"
    params: MSG=lambda wildcards: config["MSG"].format(wildcards=wildcards, **config, **globals())
    message: "{params.MSG}"
    shell: "echo '{params.MSG}' >{output}"


ABC = "This is the ABC variable"

The message and file output will be this line:

config X: Hello   variable ABC: This is the ABC variable   wildcard WW: Goodbye
tedtoal
  • 1,030
  • 1
  • 10
  • 22