Snakemake expand is generating unexpected combinations of two wildcards

Question

With the help of previous StackOverflow responses, I am using a sample dataframe to read in file information, including sample and batch to process a list of sequence files. In my rule all, I use expand and zip to create a list of target files; however, I'm encountering an error in which I get unintended combinations of the sample and batch wildcards in my output. I'm wondering if you have suggestions for how to restrict the output to those that are defined by rule all (i.e. is this issue with how I define the output from the trim_reads rule?).

For example, a section of the samples file with sample and batch wildcards:

sample,fastq_1,fastq_2,batch,run
L5011,/labs/jandr/walter/tb/data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz,/labs/jandr/walter/tb/data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz,MT02_MTB_2021-10-29,MT02_MTB_2021-10-29

When I run a snakemake dry run (snakemake -np), I get the following as an example of an unexpected combination of sample and batch (the incorrect batch is specified):

rule trim_reads:
    input: data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz, data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz
    output: results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_1.fq.gz, results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_2.fq.gz
    log: results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_reads.log
    jobid: 6137
    wildcards: batch=MT01_MtB_Baits-2021-09-17, sample=L5011

mtb/workflow/scripts/trim_reads.sh data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_1.fq.gz results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_2.fq.gz &>> results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_reads.log

Thank you very much for your help!

samples_df = pd.read_table('config/MT01-04.tsv',sep = ',').set_index("sample", drop=False)
sample_names = list(samples_df['sample'])
batch_names = list(samples_df['batch'])
#print(sample_names)

# fastq1 input function definition
def fq1_from_sample(wildcards):
  return samples_df.loc[wildcards.sample, "fastq_1"]

# fastq2 input function definition
def fq2_from_sample(wildcards):
  return samples_df.loc[wildcards.sample, "fastq_2"]

# Define a rule for running the complete pipeline. 
rule all:
  input:
    trim = expand(['results/{batch}/{samp}/trim/{samp}_trim_1.fq.gz'], zip, samp=sample_names,batch=batch_names)

# Trim reads for quality. 
rule trim_reads:  
  input: 
    p1=fq1_from_sample,
    p2=fq2_from_sample
  output:     
    trim1=temp('results/{batch}/{sample}/trim/{sample}_trim_1.fq.gz'),
    trim2=temp('results/{batch}/{sample}/trim/{sample}_trim_2.fq.gz')
  log: 
    'results/{batch}/{sample}/trim/{sample}_trim_reads.log'
  shell:
    '{config[scripts_dir]}trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'

Can you provide more details? The contents of `sample_names`, `batch_names`, and an example of an "unintended combination" would be a good start. — Troy Comi, Mar 14 '22 at 17:53

score 1 · Accepted Answer · answered Mar 14 '22 at 18:54

The example you provided looks fine. Here's a minimal example showing it working:

# Snakefile
samples = ['1', '2']
batches = ['b1', 'b2']

print(expand(['{sample}/{batch}/{sample}'], zip, sample=samples, batch=batches))

$ snakemake -nq
['1/b1/1', '2/b2/2']

Try the following:

Change your rule all so you can see what the expand output is

trim = expand(['results/{batch}/{samp}/trim/{samp}_trim_1.fq.gz'], zip, samp=sample_names,batch=batch_names)
print(trim)
rule all:
  input: trim

Check what the other rules are requesting. If you are running more than what you've posted, try snakemake --debug-dag to see which rules are requesting which outputs. Also snakemake --reason can be handy for checking individual rules.

Hope that helps, update if you are still stuck!

score 0 · Answer 2 · answered Mar 15 '22 at 00:11

Thanks to @Troy Comi for the troubleshooting help. After using snakemake --debug-dag to troubleshoot, I found that a follow-up rule wasn't using 'zip' appropriately in the expand function and was generating the unexpected target files downstream of the trim_reads rule.

Thank you again for the help troubleshooting this!

Snakemake expand is generating unexpected combinations of two wildcards

2 Answers2