1

Say I'm following the best practise workflow suggested for snakemake. Now I'd like to know how (i.e. which version) a given file, say plots/myplot.pdf, was generated. I found this surprisingly hard if not impossible only having the result folder at hand.

In more detail, say I was generated the results using. snakemake --use-conda --conda-prefix ~/.conda/myenvs which will resolve and download the conda-environments specified in the rule below (copied from the documentation):

rule NAME:
    input:
        "table.txt"
    output:
        "plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    script:
        "scripts/plot-stuff.R"

Say the content of envs/ggplot.yaml is the following:

channels:
  - conda-forge
dependencies:
  - r-ggplot2

After completion the ggplot environment will have been saved under say (note, the env name d2d1d57b assigned by snakemake automatically): ~/.conda/myevns/d2d1d57b

The problem is that if I ship the workflow subfolder e.g. as the result to someone else (or as supplement to a paper), I don't know what ggplot version was used for that run. All I know is the content of the yaml file (which is also reported when using --reports.). Also, since ggplot depends on other software, such as for instance R, I wouldn't know which R version was used for a given rule using this environment, since yaml file doesn't list indirect dependencies.

Ideally, I'd like want to have the complete environment software version shipped with the workflow results. As a workaround one could use conda env export name_of_env and copy the output in the result folder, but strangly conda list -n ~/.conda/myevns/d2d1d57b does not work ( due to error Characters not allowed: ('/', ' ', ':', '#'))

Creating a environment manually and inspecting indeed gives me (among other info):

r-base                    4.0.2                he766273_1    conda-forge
r-ggplot2                 3.3.2             r40h6115d3f_0    conda-forge

That's exactly what I'm after, but this of course would be too tedious manually.

This is also true when using wrappers as far as I can tell.

In summary, given a workflow or even for a given file within the workflow, how to trace back which exact software version(s) were used to generate it. Ideally, this information would be automatically shipped with the result of a workflow by default.

Maybe I'm even missing something very obvious, so hopefully someone can shed some light on this.

Update: issue was submitted

  • 1
    maybe the easiest solution is to just pin the version in the conda environment.. – Maarten-vd-Sande Sep 24 '20 at 13:28
  • 1
    Sounds like your are looking for the prefix flag `--prefix|-p`: e.g., `conda list -p ~/.conda/myevns/d2d1d57b`. Though, I think more preferable would be `conda env export -p ~/.conda/myevns/d2d1d57b`. This is what I've done in exactly the same situation. – merv Sep 25 '20 at 03:56
  • I didn't know about the `-p` option, that's really useful and solves at least one problem. You are right, `env export` is what I actually had in mind, I'll change it in the question. Now the question would be to have those environments automatically included along with the results? – Sebastian Müller Sep 25 '20 at 05:08
  • As for @Maarten-vd-Sande pinning suggestion. Not sure how to do that, but ideally the info versions should be shipped with results, sometimes its not possible to find the original workflow etc. – Sebastian Müller Sep 25 '20 at 05:10
  • 1
    @SebastianMüller Maybe another option (if you don't want to pin everything and want to know the complete environment) would be to do sth like `conda env export > {log}` inside the shell script? – Maarten-vd-Sande Sep 25 '20 at 06:47
  • That's a really nice idea! I've just tried it out and it seems to work indeed. Do you want to write this as an to accepted answer? Otherwise I'll do it, no problem! The only problem I anticipate is when not using `--use-conda` this will fail, so probably it needs some sort of checking meachanism. – Sebastian Müller Sep 25 '20 at 12:25

2 Answers2

3

Based on our discussion in the comments, you could redirect your environment to a log file:

rule NAME:
    input:
        "table.txt"
    output:
        "plots/myplot.pdf"
    log:
        "mylog.txt"
    conda:
        "envs/ggplot.yaml"
    shell:
        """
        conda env export > {log} 
        yourcode
        """

However as you indicate this won't work if people do not use --use-conda, plus it is tedious to add this to each rule, so you could try something like this (not tested, might not work):

if workflow.use_conda:
    shell.prefix("set -o pipefail; conda env export > {log}; ")

Which adds the export to each shell command!

Now if you use scripts, I am not so sure anymore how to continue. "easiest" might be to just call "conda env export" in a shell command inside python/R

edit

the shell prefix trick does not seem to work, so I striked through the text.

Maarten-vd-Sande
  • 3,413
  • 10
  • 27
  • Given that there is no caned answer to that question, I think this is a good enough workaround for the time being . I yet have to test the shell.prefix chunk and run some edge cases before accepting. I might still file an issue, I feel this sort of thing could be a valuable addition toward reproducibility.. – Sebastian Müller Sep 25 '20 at 17:58
  • I've accepted the answer, since this is the best achievable at the moment, though far from optimal. The `shell.prefix` would be a nice solution, but it fails since it can't resolve the wildcard `{log}` which makes sense since various other wildcars might feed in for each rule. The only alternative I got to work is `>> logs/envs.log; ` instead, which is far from ideal but at least it works. Maybe you can change your answer accordingly? – Sebastian Müller Oct 13 '20 at 08:17
  • @SebastianMüller thanks for coming back on this. Too bad that does not work.. I updated the answer – Maarten-vd-Sande Oct 13 '20 at 08:28
  • Yes, too bad, but I've just filed an issue to address this: https://github.com/snakemake/snakemake/issues/685 . Thanks again! – Sebastian Müller Oct 13 '20 at 09:45
0

As @Maarten-vd-Sande mentioned, version should be specified in the conda env file. Just as you may have thought, you will also need to define r-base and its version in conda env file so as to ensure the use of specific version of R. See here for an example from a snakemake-wrapper.

As part of best practices towards reproducible research, it is highly recommended to specify tool versions in conda env files. Snakemake-wrappers typically follow this rule, but you might find some not following this.

Manavalan Gajapathy
  • 3,900
  • 2
  • 20
  • 43
  • Well, I've already thought about specifying specific versions in the yaml files, hence I mentioned it already. r-base is just an example, but say I don't know all the dependencies of a given package, I'd have to do a lot of research to come up with a complete list with all (minor) versions. Conda already does this, I just would like to have those resolved envirments shipped with the results. Also, how do I easily (i.e. not using any external webpages) find out a actual software version(s) when using wrappers ? – Sebastian Müller Sep 24 '20 at 19:05
  • 1
    Apparently I missed that. Perhaps create the environment first and then export env definitions to file? This can be achieved using `conda env export --name ENVNAME > envname.yml`. As for your wrapper question, I typically go to their github page or wrapper's website to find out the tools and versions used in env file. If snakemake already pulled it as part of the workflow, you can find them at `.snakemake/conda/.yaml`. – Manavalan Gajapathy Sep 24 '20 at 19:25
  • That's a good point, but this would add a lot of overhead also I'd like to go with the recommended workflow approach (where this is not been mentioned). Also sometimes you get results from some other workflow, it would be nice to have the info about resolved environments. Unfortunattely, your code `cond env export` doesn't work with automatic conda env similar to the `conda list`, ie.e `conda env export > ~/.conda/myevns/d2d1d57b` throws an error. I take that this is not an easy task then and I didn't miss anything obvious, maybe I'll submit an issue on github? – Sebastian Müller Sep 25 '20 at 04:59