0

I am trying to create a pipeline that will take a user-configured directory in config.yml (where they have downloaded a project directory of .fastq.gz files from BaseSpace), to run fastqc on sequence files. I already have the downstream steps of merging the fastqs by lane and running fastqc on the merged files.

However, the wildcards are giving me problems running fastqc on the original basespace files. The following is my error when I try running snakemake.

Missing input files for rule all:
qc/fastqc_premerge/DEX-13_S9_L001_ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b_r1_fastqc.zip
qc/fastqc_premerge/BOMB-3-2-19D_S8_L002_ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e_r1_fastqc.zip
qc/fastqc_premerge/DEX-13_S9_L002_ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59_r1_fastqc.zip

Any suggestions would be greatly appreciated. Below is minimal code to reproduce this problem.

import glob

configfile: "config.yaml"

wildcard_constraints:
   bsdir = '\w+_L\d+_ds.\w+',
   lanenum = '\d+'

inputdirectory=config["directory"]
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R1_001.fastq.gz")
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R2_001.fastq.gz")


##### target rules #####
rule all:
    input:
       #expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
        expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)  ##Changed to this from commenters suggestion, however, snakemake still wont run


rule fastqc_premerge_r1:
    input:
        f"{config['directory']}/{{bsdir}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
    output:
        html="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.html",
        zip="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params: ""
    log:
        "logs/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.log"
    threads: 1
    wrapper:
        "v0.69.0/bio/fastqc"

Directory structure:

ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R1_001.fastq.gz
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R2_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R1_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R2_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R1_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R2_001.fastq.gz

In this above case, I would like to run fastqc on all 6 input R1/R2 files, then downstream, create a merged file for DEX_13_S9 (for the two inputs to merge) and BOMB-3_2_19D (which will be a copy of the 1 input). Then create 4 fastqc reports on these resulting R1 and R2 files.

EDIT: I had to change the following to get snakemake to run

inputdirectory=config["directory"]
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R1_001.fastq.gz", followlinks=True)
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R2_001.fastq.gz", followlinks=True)


##### target rules #####
rule all:
    input:
       "qc/multiqc_report_premerge.html"




rule fastqc_premerge_r1:
    input:
        f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
    output:
        html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.html",
        zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
    params: ""
    log:
        "logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.log"
    threads: 1
    wrapper:
        "v0.69.0/bio/fastqc"

rule fastqc_premerge_r2:
    input:
        f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R2_001.fastq.gz"
    output:
        html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.html",
        zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
    params: ""
    log:
        "logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.log"
    threads: 1
    wrapper:
        "v0.69.0/bio/fastqc"

rule multiqc_pre:
    input:
        expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS),
        expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS)
    output:
        "qc/multiqc_report_premerge.html"
    log:
        "logs/multiqc_premerge.log"
    wrapper:
        "0.62.0/bio/multiqc"

m00am
  • 5,910
  • 11
  • 53
  • 69
HansVG
  • 35
  • 6
  • I'm afraid you need to find this out yourself. Usually, Snakemake starts building the DAG immediately. If not, there is some Python code in your Snakefile that takes ages to complete. Try to debug a bit, add print statements, etc. The stack trace points to line 70 in your Snakefile. – Johannes Köster Feb 02 '21 at 20:34
  • Thank you Johannes, I got it working. I put the fixed Snakemake code at the bottom of my comment. Two key problems in my original code. 1) needed to use zip in the expand rule. 2) needed to split up the subdirectory wildcard from bsdir, to multiple wildcards so I could correctly build the path. – HansVG Feb 02 '21 at 22:00

2 Answers2

1

In your rule all you have:

expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)

This should generate all combinations of SAMPLES, DIRECTORY, and LANENUMS. Is this what you want? I suspect not since it means that all samples are in all directories and they are on all lanes. Maybe you want the zip function to expand the list:

expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
dariober
  • 8,240
  • 3
  • 30
  • 47
  • Thank you for your suggestion. You are correct, I don't want all combinations and should be using the zip function. However, I am getting the same missing input error. I've updated my question above to include exactly my test case of files. – HansVG Feb 01 '21 at 17:41
  • Is there an easy way to print the name of my input file that is 'missing'? It seems like I should be constructing the filepath correctly, but it would be nice to find out. I tried adding -p to my command, but it doesn't print the path of the input files. – HansVG Feb 01 '21 at 18:06
  • Try to run Snakemake (in dryrun) with one of the files that is printed as missing as a target (at the command line). Then, it will tell you why it cannot use the fastqc rule to generate it. – Johannes Köster Feb 02 '21 at 18:41
  • I tried running it with the target, but it errors saying "MissingRuleException: No rule to produce ". However, shouldn't the rule `fastqc_premerge_r1` output: parameter has the definition of the zip output file? I'm not sure why snakemake doesn't know how to generate this fastqc.zip file. Full log here: https://pastebin.com/14s5RWtK – HansVG Feb 02 '21 at 18:56
  • Your bsdir wildcard constraint restricts the bsdir to contain actually the lanenum. But in the output file definition of your rule, you have a separate wildcard for the lane. – Johannes Köster Feb 02 '21 at 19:14
  • I am trying to restrict bsdir to be the full subfolder name. Format looks like this, ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/, then has 2 files for R1/R2 DEX-13_S9_L001_R1_001.fastq.gz. I thought my glob_wildcards rule would work to get the full subdirectory (bsdir), then parseout from the filename the sample and lanenumber. DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R1_001.fastq.gz"). I believe I need to provide a lane wildcard so I can create the full path to these input files for `fastqc_premerge_r1`. – HansVG Feb 02 '21 at 19:36
1

It's telling you what files are missing; that's what the lines under "missing input files for rule all" are.

That being said, to answer your original question, if you do a dry run, that should tell you what the input/output files are for each planned rule you want to run (use flags -n -r) in your run command.

  • Thank you for your suggestions. Yes, those files listed are the inputs for rule all. However, I am trying to figure out why the rule `fastqc_premerge_r1` isn't running. That rule should be producing the missing input files to rule all. ``` Building DAG of jobs... MissingInputException in line 21 of /data/projects/nathan_ernster/my_basespace_qc_pipeline/Snakefile: Missing input files for rule all: qc/fastqc_premerge/DEX-13_S9_L001_ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b_r1_fastqc.zip ``` – HansVG Feb 01 '21 at 20:35
  • Try `snakemake -n -r -F`? With my understanding, that should work. – Brendan Kohrn Feb 01 '21 at 21:38
  • I tried both `snakemake --use-conda -n -r -F` and `snakemake --use-conda -n -R fastqc_premerge_r1` to try to run the fastqc_pre_merge_r1 rule, but receive the same error as above. – HansVG Feb 01 '21 at 22:23