Process multiple directories and all files within using snakemake

Question

I've got a directory with 10 sub-directories (dir01 to dir10) and a number of files in each of those (new files are added every day to the sub-directories).

I'm trying to write a snakemake file that will go through all of the sub-directories and all the files and process them (run my convert.exe executable to convert my .Stp files to .Xml). The processed files will be moved to a new directory but into sub-directories with the same names as before and the same file name.

So - as an example in the end the final job flow should run similar to this:

/data01/dir01/Sample1.Stp --> processed by convert.exe --> /data01/temp/dir01/Sample1.xml

I'd also like to divide this over the 12 CPUs I've got access to, running it in parallel.

I've just started using snakemake and have gone through a couple tutorials however am getting a little lost.

Here is what I have so far: It's not working and I'm not even sure if this is the write way to go about it. This is also only the first part - just trying to loop through the directories and files (not trying to convert or run in parallel yet).

directories = glob_wildcards("/data01/{dir}")
files = glob_wildcards("/data01/{dir}/{file}")

rule all:
        input:
                expand("/data01/temp/{dir}/{file}.moved.Stp", dir=directories, file=files)

rule sort:
        input:
                "/data01/{dir}/{file}.Stp"
        output:
                "/data01/temp/{dir}/{file}.moved.Stp"
        shell:
                "..."

Any help about how to go about this would be greatly appreciated!

Thanks!

score 5 · Answer 1 · edited May 14 '21 at 12:32

Based on this FAQ entry, try this:

directories, files = glob_wildcards("data01/{dir}/{file}")

rule all:
    input:
        expand("data01/temp/{dir}/{file}.moved.Stp",
               zip, dir=directories, file=files)

rule copy:
    input:
        "data01/{dir}/{file}.Stp"
    output:
        "data01/temp/{dir}/{file}.moved.Stp"
    shell:
        "cp {input} {output}"

Your glob_wildcards are not working. You would need

directories, = glob_wildcards("/data01/{dir}")

But you really need to glob everything in one go, as in my example. expand will make all (N x N) combinations of the two input lists. You could use that feature if you have exactly the same files in every directory. However, providing zip combines the two lists element by element.

Process multiple directories and all files within using snakemake

1 Answers1

Linked