0

I am building a snakemake pipeline with python scripts.

Some of the python scripts take as input a directory, while others take as input files inside those directories.

I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?

Example of what I am doing showing only two rules:

FILES = glob.glob("data/*/*raw.csv")
FOLDERS = glob.glob("data/*/")

rule targets:
  input:
    processed_csv = expand("{files}raw_processed.csv", files =FILES),
    normalised_csv = expand("{folders}/normalised.csv", folders=FOLDERS)

rule process_raw_csv:
  input: 
    script = "process.py",
    csv = "{sample}raw.csv"
  output:
    processed_csv = "{sample}raw_processed.csv"
  shell:
  "python {input.script} -i {input.csv} -o {output.processed_csv}"

rule normalise_processed_csv:
  input:
    script = "normalise.py",
    processed_csv = "{sample}raw_processed.csv" #This is input to the script but is not parsed, instead it is fetched within the code normalise.py
    
  params:
    folder = "{folders}"
  
  output:
    normalised_csv = "{folders}/normalised.csv" # The output 
  
  shell:
  "python {input.script} -i {params.folder}"



Some python scripts (process.py) take all the files they needed or produced as inputs and they need to be parsed. Some python scripts only take the main directory as input and the inputs are fetched inside and the outputs are written on it.

I am considering rewriting all the python scripts so that they take the main directory as input, but I think there could be a smart solution to be able to run these two types on the same snakemake pipeline.

Thank you very much in advance.

P.S. I have checked and this question is similar but not the same: Process multiple directories and all files within using snakemake

Ulises Rey
  • 75
  • 8

1 Answers1

0

I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?

I don't see anything special with this requirement... What about this?

rule one:
    output:
        d=directory('{sample}'),
        a='{sample}/a.txt',
        b='{sample}/b.txt',
    shell:
        r"""
        mkdir -p {output.d}
        touch {output.a}
        touch {output.b}
        """

rule use_dir:
    input:
        d='{sample}',
    output:
        out='outdir/{sample}.out',
    shell:
        r"""
        cat {input.d}/* > {output.out}
        """

rule use_files:
    input:
        a='{sample}/a.txt',
        b='{sample}/b.txt',
    output:
        out='outfiles/{sample}.out',
    shell:
        r"""
        cat {input.a} {input.b} > {output.out}
        """

rule use_dir will use the content of directory {sample}, whatever it contains. Rule use_files will use specifically files a.txt and b.txt from directory {sample}.

dariober
  • 8,240
  • 3
  • 30
  • 47
  • Hi, Thank you for your answer. This did not work because the files are not called "a.txt" or "b.txt", but they are called each one on separate folders with a different string, like this: sample1/the_sample1_a.txt sample1/the_sample1_b.txt sample/2the_good_sample2_a.txt sample2/the_good_sample2_b.txt – Ulises Rey Mar 15 '23 at 12:51