0

I am very new to snakemake and I am trying to create a merged.fastq for each sample. Following is my Snakefile.

configfile: "config.yaml"
print(config['samples'])
print(config['ss_files'])
print(config['pass_files'])

rule all:
    input:
        expand("{sample}/data/genome_assembly/medaka/medaka.fasta", sample=config["samples"]),
        expand("{pass_file}", pass_file=config["pass_files"]),
        expand("{ss_file}", ss_file=config["ss_files"]) 

rule merge_fastq:
    input: 
        directory("{pass_file}")
    output: 
        "{sample}/data/merged.fastq.gz"
    wildcard_constraints:
        id="*.fastq.gz"
    shell:
        "cat {input}/{id} > {output}"   

where, 'samples' is a list of sample names,
'pass_files' is a list of directory path to fastq_pass folder which contains small fastq files

I am trying to merge small fastq files to a large merged.fastq for each sample.

I am getting the following,

Wildcards in input files cannot be determined from output files: 'pass_file'

as the error.

vinay kusuma
  • 65
  • 1
  • 9

1 Answers1

1

Each wildcard in the input section shall have a corresponding wildcard (with the same name) in the output section. That is how Snakemake works: when the Snakemake tries to constract the DAG of jobs and finds that it needs a certain file, it looks at the output section for each rule and checks if this rule can produce the required file. This is the way how Snakemake assigns certain values to the wildcard in the output section. Every wildcard in other sections shall match one of the wildcards in the output, and that is how the input gets concrete filenames.

Now let's look at your rule merge_fastq:

rule merge_fastq:
    input: 
        directory("{pass_file}")
    output: 
        "{sample}/data/merged.fastq.gz"
    wildcard_constraints:
        id="*.fastq.gz"
    shell:
        "cat {input}/{id} > {output}"   

The only wildcard that can get its value is the {sample}. The {pass_file} and {id} are dangling.

As I see, you are trying to merge the files that are not known on the design time. Take a look at the dynamic files, checkpoint and using a function in the input.

The rest of your Snakefile is hard to understand. For example I don't see how you specify the files that match this pattern: "{sample}/data/merged.fastq.gz".

Update:

Lets say, I have a directory(/home/other_computer/jobs/data/<sample_name>/*.fastq.gz) which is my input and output is (/result/merged/<sample_name>/merged.fastq.gz). What I tried is having the first path as input: {"pass_files"} (this comes from my config file) and output : "result/merged/{sample}/merged.fastq.gz"

First, let's simplify the task a little bit and replace the {pass_file} with the hardcoded path. You have 2 degrees of freedom: the <sample_name> and the unknown files in the /home/other_computer/jobs/data/<sample_name>/ folder. The <sample_name> is a good candidate for becoming a wildcard, as this name can be derived from the target file. The unknown number of files *.fastq.gz doesn't even require any Snakemake constructs as this can be expressed using a shell command.

rule merge_fastq:
    output: 
        "/result/merged/{sample_name}/merged.fastq.gz"
    shell:
        "cat /home/other_computer/jobs/data/{sample_name}/*.fastq.gz > {output}"
Dmitry Kuzminov
  • 6,180
  • 6
  • 18
  • 40
  • Thanks for replying, I am merging (many small fastq files) and copying the merged fastq from one location to other. I am not able to understand how to design this in snakemake as you said the wildcards in input and output should be same. As my locations of output and input is different, I am not able to come up with a solution to keep it same. – vinay kusuma Sep 07 '20 at 08:30
  • @vinaykusuma There is not problem in specifiying different location for input and output as wildcards may be just a part of the filename. Anyway, so far you haven't described the pattern what files you are merging. Your question cannot be answered without proper focus. – Dmitry Kuzminov Sep 07 '20 at 15:26
  • Lets say, I have a directory(/home/other_computer/jobs/data//*.fastq.gz) which is my input and output is (/result/merged//merged.fastq.gz). What I tried is having the first path as input: {"pass_files"} (this comes from my config file) and output : "result/merged/{sample}/merged.fastq.gz". As you see I need to use two different wildcards in input and output and I can't use same in both. I stuck on this. Did I make it clear? – vinay kusuma Sep 07 '20 at 15:42