3

I have a dictionary with keys as patient IDs and a list of fastq files as values.

patient_samples = {
  "patientA": ["sample1", "sample2", "sample3"],
  "patientB": ["sample1", "sample4", "sample5", "sample6"]
}

I want to align each sample.fastq and output the aligned .bam file in a directory for each patient. The resulting directory structure I want is this:

├── patientA
│   ├── sample1.bam
│   ├── sample2.bam
│   ├── sample3.bam
├── patientB
│   ├── sample1.bam
│   ├── sample4.bam
│   ├── sample5.bam
│   ├── sample6.bam

Here I used lambda wildcards to get the samples for each patient using the "patient_samples" dictionary.

rule align:
    input:
        lambda wildcards: \
            ["{0}.fastq".format(sample_id) \ 
            for sample_id in patient_samples[wildcards.patient_id]
            ]
    output:
        {patient_id}/{sample_id}.bam"
    shell:
        ### Alignment command

How can I write the rule all to reflect that only certain samples are aligned for each patient? I have tried referencing the dictionary key to specify the samples:

rule all:
    input:
        expand("{patient_id}/{sample_id}.bam", patient_id=patient_samples.keys(), sample_id=patient_samples[patient_id])

However, this leads to a NameError: name 'patient_id' is not defined

Is there another way to do this?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Snakemake users should have in mind that rule inputs are just Python lists of filenames, that can be constructed using whatever programming technique is more convenient (see this answer for an example: https://stackoverflow.com/a/73684012/1878788). `expand` is just one Snakemake-provided custom Python function to generate lists, and using it is not always the most convenient solution. – bli Sep 13 '22 at 12:27

1 Answers1

2

The error is because the expand command does not know what is the patient_id to use when listing the sample_id values:

expand(
   "{patient_id}/{sample_id}.bam",
   patient_id=patient_samples.keys(),
   sample_id=patient_samples[patient_id])
                                ^^^^^ Unknown

Using expand is convenient when you already have lists with wildcard values, in more complex cases it's best to use python:

list_inputs_all = [
   f"{patient_id}/{sample_id}.bam"
   for patient_id, sample_id
   in patient_samples.items()
]
   
rule all:
    input:
        list_inputs_all
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46