4

I'm starting to experiment with using containers with Snakemake, and I have a question about what needs to be pre-build into the container and what doesn't. For example:

I want to run a python script (stored in workflow_root/scripts/myScript.py, for example) in a container with a pipe in from another program. Do I need to build the python script into the container, declare it as an input file, or is that accessible from within the container (and how do I point to it)? My current rule looks something like:

rule myRule:
    params:
        sample = get_sample,
        basePath = sys.path[0]
    input:
        in1=get_in1,
        in2=get_in2
    output:
        out1 = "{runPath}/{sample}_read1_dcs.fq.gz",
        out2 = "{runPath}/{sample}_read2_dcs.fq.gz"
    priority: 50
    conda:
       "envs/myEnv.yaml"
    log:
        "{runPath}/logs/{sample}_myRule.log"
    shell:
        """
        set -e
        set -o pipefail
        set -x
        {{
        picard FastqToSam \
        F1={input.in1} \
        F2={input.in2} \
        O=/dev/stdout \
        SM={params.sample} \
        TMP_DIR=picardTempDir \
        SORT_ORDER=unsorted \
        | python3 {params.basePath}/scripts/myScript.py \
        --input /dev/stdin \
        --prefix {wildcards.sample}
        }} 2>&1 | tee -a {log}
        """

I want to run bwa, where I have a sizable user-provided reference that I need to use. Can I do this, or would I need to build that reference into the container? (I'd also like to use ensemble-VEP, which has its own sizable reference database to deal with).

I suppose what my question boils down to is: what files / locations are mounted to the container by Snakemake, and where do I find them when I'm writing rules involving shell commands? The documentation doesn't seem to be very clear on this, and it would be nice to be able to figure it out without having to do a bunch of experimentation.

  • It might be worth mentioning the `--containerize` option, which will create a Dockerfile for the pipeline, and then you can see what Snakemake is programmed to expect in a container. – merv Oct 08 '21 at 21:06

2 Answers2

2

I will share how I use snakemake with singularity and conda. This setup has worked very well for me for over an year now. This setup may or may not suit your purposes; so feel free to ask questions.

  • Snakemake workflow along with necessary scripts, configs and docs are in a separate git repo. Data would not be stored here.
  • Singularity container is defined at global level via singularity: directive in the workflow. I don't manually build the container; snakemake does that.
  • Conda environments are defined per rule via conda: directive. In complex projects, sometimes this may be >15-20 separate conda environments. Snakemake builds these conda environments inside the singularity container.
  • Data are kept separate from the source code. Instead the workflow config file (typiecally defined via configfile: directive) contains their path info.

An example Snakefile would look like this:

# config for the workflow
configfile: "configs/workflow_configs.yaml"

# singularity image to use
singularity: "docker://continuumio/miniconda3:4.7.12"

rule all:
   input:
       .....

rule some_job:
    input:
        .....
    output:
        .....
    conda: 
        "path/to/conda_env.yaml"
    shell:
        "....."

This setup works as long as all the tools needed for the workflow are available via conda. It is also possible to override global singularity container for a specific rule and use a different singularity container or no container at all.

Also, I don't build my own singularity containers and instead I just use the generic ones available from docker-hub. I don't have privileges to build my own singularity container in the HPC systems I work with, anyway. So this setup removes the hassle of building the image somewhere and then moving it to the HPC environment.

Manavalan Gajapathy
  • 3,900
  • 2
  • 20
  • 43
  • 1
    How would you define the location for the data in the config file? Is it a particular keyword name? – Brendan Kohrn Oct 13 '21 at 23:42
  • No to specific keyword usage. If using [`configfile:`](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#standard-configuration), they need to be in `json` or `yaml` format, and what goes into them is up to you. It could be as simple as defining the paths directly as I have done [here](https://github.com/ManavalanG/personal_genome_analysis/blob/master/configs/example_configs.yaml) or as (relatively) complex as that implemented in [dna-seq-gatk-variant-calling pipeline](https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/main/config/config.yaml). – Manavalan Gajapathy Oct 14 '21 at 02:57
1

Everything under the working directory is available and accessible as expected. Things referenced through absolute paths, however, will require mounting.

So, if your script is under scripts/my_script.py, that will be available as is. But, if you use, for example, a temporary directory like /scratch/user42, then you will need to mount that with something like

snakemake --singularity-args '-B /scratch/user42:/scratch/user42'

Personally, I specify my mounts in the config.yaml for my cluster profile, so I don't ever have to bother with the extra CLI arguments. For example, something like

~/.config/snakemake/[profile_name]/config.yaml

singularity-args: '-B /scratch/user42:/scratch/user42'

will then include this setting whenever --profile [profile_name] is used.

merv
  • 67,214
  • 13
  • 180
  • 245
  • So, is it possible to choose what locations are mounted within the Snakefile, such as from locations specified in a CSV file for each set of inputs? Edit: I currently use a csv config file that has the parameters and such for each set of input files. This includes things like what reference genome to use, what bed file to use, etc., and those may be in separate locations elsewhere on the computer. – Brendan Kohrn Oct 11 '21 at 15:49
  • @BrendanKohrn I only know about using it with `--profiles`. I don't believe Snakemake respects arguments put in a normal `config.yaml` that is used with `configfile:` in the Snakefile. But yes, with a profile, one can either set it at the profile level, or include a local file (in my case `lsf.yaml`) that provides additional settings. – merv Oct 11 '21 at 16:33