3

I have a json file like so:

{
    "foo": {
        "bar1": 
            {"A1": {"name": "A1", "path": "/path/to/A1"}, 
             "B1": {"name": "B1", "path": "/path/to/B1"},
             "C1": {"name": "C1", "path": "/path/to/C1"},
             "D1": {"name": "D1", "path": "/path/to/D1"}},
        "bar2": 
            {"A2": {"name": "A2", "path": "/path/to/A2"}, 
             "B2": {"name": "B2", "path": "/path/to/B2"},
             "C2": {"name": "C2", "path": "/path/to/C2"},
             "D2": {"name": "D2", "path": "/path/to/D2"}}}
}

I am trying to run my snakemake pipeline on the samples in sample sets 'bar1' and 'bar2' separately, putting the results into their own folders. When I expand my wildcards I don't want all iterations of sample sets and samples, I just want them in their specific groups, like this:

tmp/bar1/A1.bam
tmp/bar1/B1.bam
tmp/bar1/C1.bam
tmp/bar1/D1.bam
tmp/bar2/A2.bam
tmp/bar2/B2.bam
tmp/bar2/C2.bam
tmp/bar2/D2.bam

Hopefully my snakefile will help explain. I have tried having my snakefile like this:

sample_sets = [ i for i in config['foo'] ]

samples_dict = config['foo'] #cleans it up

def get_samples(wildcards):
    return list(samples_dict[wildcards.sample_set].keys())

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = get_samples), sample_set = sample_sets),

This doesn't work, my file names end up with "<function get_samples at 0x7f6e00544320>" in them! I have also tried:

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = list(samples_dict["{{sample_set}}"].keys()), sample_set = sample_sets),

but that get's a KeyError. Have also tried this:

rule all:
    input:
        [ ["tmp/{{sample_set}}/{sample}.aligned_bam.core.bam".format( sample = sample ) for sample in list(samples_dict[sample_set].keys())] for sample_set in sample_sets ]

which gets an "Wildcards in input files cannot be determined from output files: 'sample_set'" error.

I feel like there must be a simple way of doing this and perhaps I'm being a moron.

Any help would be very much appreciated! And let me know if I've missed some detail.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
ajforster
  • 33
  • 2
  • I believe the snakemake way to do this is to use an input function, then you can precisely control what is used as input based on an output wildcard that controls the groups of files to return in the input function expand. https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=touch#input-functions. – hermidalc Sep 06 '22 at 08:27
  • Or possibly sorry another snakemake way to do it is to use a partial expand with `allow_missing=True`, you would still need an aggregate rule to process each group so cannot be in `rule all` – hermidalc Sep 06 '22 at 10:12

2 Answers2

1

There is a possibility of using a custom combinatoric function in expand. Most often this function is zip, however, in your case the nested dictionary shape will require designing a custom function. Instead, a simpler solution is to use Python to construct the list of desired files.

d = {
    "foo": {
        "bar1": {
            "A1": {"name": "A1", "path": "/path/to/A1"},
            "B1": {"name": "B1", "path": "/path/to/B1"},
            "C1": {"name": "C1", "path": "/path/to/C1"},
            "D1": {"name": "D1", "path": "/path/to/D1"},
        },
        "bar2": {
            "A2": {"name": "A2", "path": "/path/to/A2"},
            "B2": {"name": "B2", "path": "/path/to/B2"},
            "C2": {"name": "C2", "path": "/path/to/C2"},
            "D2": {"name": "D2", "path": "/path/to/D2"},
        },
    }
}

list_files = []

for key in d["foo"]:
    for nested_key in d["foo"][key]:
        _tmp = f"tmp/{key}/{nested_key}.bam"
        list_files.append(_tmp)

print(*list_files, sep="\n")
#tmp/bar1/A1.bam
#tmp/bar1/B1.bam
#tmp/bar1/C1.bam
#tmp/bar1/D1.bam
#tmp/bar2/A2.bam
#tmp/bar2/B2.bam
#tmp/bar2/C2.bam
#tmp/bar2/D2.bam
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • 1
    Thank you so much. I thought there might be a snakemake way but this works. I have a few different files in rule all so I wrote a function to based on your code to make a list which takes the file extension as an arg. For rules which require all samples (from one sample set) as an input I use: `lambda wildcards: expand("tmp/{{sample_set}}/{sample}_other_file_name.tsv", sample = list(d[wildcards.sample_set].keys()))`. Thanks again – ajforster Feb 22 '22 at 14:27
0

@SultanOrazbayev has the right of it, but just to throw in a couple of alternatives.

If you like the loops, the pythonic way to write it is with list comprehensions. If you have giant file lists you may notice an improvement in performance.

list_files = [
    f"tmp/{key}/{nested_key}.bam"
    for key in d["foo"]
    for nested_key in d["foo"][key]
]

The only way I can think to use expand is basically constructing the same list. I pass it in as a dict too keep the wildcard names, though a tuple would be more efficient. The advantage of expand would be if you have your file names in a config variable and can't easily format it, want to keep meaningful wildcard names, or use allow_missing for other wildcards:

wcs = [{'sample_set': sample_set, 'sample': sample}
    for sample_set in d["foo"]
    for sample in d["foo"][sample_set]
    ]


list_files = expand("tmp/{sample_set}/{sample}.bam", zip, 
        sample_set=[wc['sample_set'] for wc in wcs],
        sample=[wc['sample'] for wc in wcs],
        )

Sometimes the snakemake way isn't pythonic!

Troy Comi
  • 1,579
  • 3
  • 12