With the help of previous StackOverflow responses, I am using a sample dataframe to read in file information, including sample and batch to process a list of sequence files. In my rule all, I use expand and zip to create a list of target files; however, I'm encountering an error in which I get unintended combinations of the sample and batch wildcards in my output. I'm wondering if you have suggestions for how to restrict the output to those that are defined by rule all (i.e. is this issue with how I define the output from the trim_reads rule?).
For example, a section of the samples file with sample and batch wildcards:
sample,fastq_1,fastq_2,batch,run
L5011,/labs/jandr/walter/tb/data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz,/labs/jandr/walter/tb/data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz,MT02_MTB_2021-10-29,MT02_MTB_2021-10-29
When I run a snakemake dry run (snakemake -np), I get the following as an example of an unexpected combination of sample and batch (the incorrect batch is specified):
rule trim_reads:
input: data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz, data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz
output: results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_1.fq.gz, results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_2.fq.gz
log: results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_reads.log
jobid: 6137
wildcards: batch=MT01_MtB_Baits-2021-09-17, sample=L5011
mtb/workflow/scripts/trim_reads.sh data/MT02_MTB_2021-10-29/L5011_S16_L001_R1_001.fastq.gz data/MT02_MTB_2021-10-29/L5011_S16_L001_R2_001.fastq.gz results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_1.fq.gz results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_2.fq.gz &>> results/MT01_MtB_Baits-2021-09-17/L5011/trim/L5011_trim_reads.log
Thank you very much for your help!
samples_df = pd.read_table('config/MT01-04.tsv',sep = ',').set_index("sample", drop=False)
sample_names = list(samples_df['sample'])
batch_names = list(samples_df['batch'])
#print(sample_names)
# fastq1 input function definition
def fq1_from_sample(wildcards):
return samples_df.loc[wildcards.sample, "fastq_1"]
# fastq2 input function definition
def fq2_from_sample(wildcards):
return samples_df.loc[wildcards.sample, "fastq_2"]
# Define a rule for running the complete pipeline.
rule all:
input:
trim = expand(['results/{batch}/{samp}/trim/{samp}_trim_1.fq.gz'], zip, samp=sample_names,batch=batch_names)
# Trim reads for quality.
rule trim_reads:
input:
p1=fq1_from_sample,
p2=fq2_from_sample
output:
trim1=temp('results/{batch}/{sample}/trim/{sample}_trim_1.fq.gz'),
trim2=temp('results/{batch}/{sample}/trim/{sample}_trim_2.fq.gz')
log:
'results/{batch}/{sample}/trim/{sample}_trim_reads.log'
shell:
'{config[scripts_dir]}trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'