SnakeMake rule with Python script, conda and cluster

Question

I would like to get snakemake running a Python script with a specific conda environment via a SGE cluster.

On the cluster I have miniconda installed in my home directory. My home directory is mounted via NFS so accessible to all cluster nodes.

Because miniconda is in my home directory, the conda command is not on the operating system path by default. I.e., to use conda I need to first explicitly add this to the path.

I have a conda environment specification as a yaml file, which could be used with the --use-conda option. Will this work with the --cluster "qsub" option also?

FWIW I also launch snakemake using a conda environment (in fact the same environment I want to run the script).

Quick answer: yes, --use-conda will also trigger conda usage in the cluster jobs. For miniconda to work "easily" in the general case, it should be in PATH. — Johannes Köster, Aug 25 '17 at 11:45

TBoyarski · Accepted Answer · 2017-08-22T00:22:56.273

I have an existing Snakemake system, running conda, on an SGE cluster. It's delightful and very do-able. I'll try to offer perspective and guidance.

The location of your miniconda, local or shared, may not matter. If you are using a login to access your cluster, you should be able to update your default variables upon logging in. This will have a global effect. If possible, I highly suggest editing the default settings in your .bashrc to accomplish this. This will properly, and automatically, setup your conda path upon login.

One of the lines in my file, "home/tboyarski/.bashrc"

 export PATH=$HOME/share/usr/anaconda/4.3.0/bin:$PATH

EDIT 1 Good point made in comment

Personally, I consider it good practice to put everything under conda control; however, this may not be ideal for users who commonly require access to software not supported by conda. Typically support issues have to do with using old operating systems (E.g. CentOS 5 support was recently dropped). As suggested in the comment, manually exporting the PATH variable in a single terminal session may be more ideal for users who do not work on pipelines exclusively, as this will not have a global effect.

With that said, like myself prior to Snakemake execution, I recommend initializing the conda environment used by the majority, or entirety of your pipeline. I find this the preferred way as it allows conda to create the environment, instead of getting Snakemake to ask conda to create the environment. I don't have the link for the web-dicussion, but I believe I read somewhere that individuals who only rely on Snakemake to create the environments, not lanching from a base environment, they found that the environments were being stored in the /.snakemake directory, and that it was getting excessively large. Feel free to look for the post. The issue was address by the author who reduced the load on the hidden folder, but still, I think it makes more sense to launch the jobs from an existing Snakemake environment, which interacts with your head node, and then passes the corresponding environmental variables to it's child nodes. I like a bit of hierarchy.

With that said, you will likely need to pass the environments to your child nodes if you are running Snakemake from your head node's environment and letting Snakemake interact with the SGE job scheduler, via qsub. I actually use the built-in DRMAA feature, which I highly recommend. Both submission mediums require me to provide the following arguments:

   -V     Available for qsub, qsh, qrsh with command and qalter.

         Specifies that all environment variables active within the qsub
          utility be exported to the context of the job.

Also...

  -S [[hostname]:]pathname,...
         Available for qsub, qsh and qalter.

         Specifies the interpreting shell for the job.  pathname must be
          an executable file which interprets command-line options -c and
          -s as /bin/sh does.

To give you a better starting point, I also specify virtual memory and core counts, this might be specific to my SGE system, I do not know.

-V -S /bin/bash -l h_vmem=10G -pe ncpus 1

I highly expect you'll require both arguments when submitting the the SGE cluster, as I do personally. I recommend putting your cluster submission variables in JSON format, in a separate file. The code snippet above can be found in this example of what I've done personally. I've organized it slightly differently than in the tutorial, but that's because I needed a bit more granularity.

Personally, I only use the --use-conda command when running a conda environment different than the one I used to launch and submit my Snakemake jobs. Example being, my main conda environment runs python 3, but if I need to use a tool that say, requires python 2, I will then and only then use Snakemake to launch a rule, with that specific environment, such that the execution of that rule uses a path corresponding to a python2 installation. This was of huge importance by my employer, as the existing system I was replacing struggled to seemlessly switch between python2 and 3, with conda and snakemake, this is very easy.

In principle I think this is good practice to launch a base conda environemnt, and to run Snakemake from there. It encourages the use of a single environment for the entire run. Keep it simple, right? Complicate things only when necessary, like when needing to run both python2 and python3 in the same pipeline. :)

Very helpful, thanks. I don't like to use .bashrc to put conda on my path because I find it clobbers lots of system binaries causing other tools to break, however I guess if I manually put conda on the path in order to activate my snakemake environment, then use the -V flag with qsub, there shouldn't be a need to have in .bashrc? — Alistair Miles, Aug 21 '17 at 22:25
As described, I think you shouldn't need to edit your .bashrc, however, I encourage you, if possible, to move towards having everything under conda control. I encourage this as a gold standard for research purposes, as it makes it package-able and distributable. For me, EVERYTHING I DO is pipeline related, as such, I edit my .bashrc for a global effect. A more multi-faceted user using software not supported by conda would likely use your suggested approach, of setting the environment for a specific terminal session and subsequent pipeline execution — TBoyarski, Aug 22 '17 at 00:11

SnakeMake rule with Python script, conda and cluster

1 Answers1

Linked