1

I am trying to clean a data pipeline by using snakemake. It looks like wildcards are what I need but I don't manage to make it work in params

My function needs a parameter that depends on the wildcard value. For instance, let's say it depends on sample that can either be A or B.

I tried the following (my example is more complicated but this is basically what I am trying to do) :

sample = ["A","B"]

import pandas as pd

def dummy_example(sample):
    return pd.DataFrame({"values": [0,1], "sample": sample})

rule all:
    input:
        "mybucket/sample_{sample}.csv"

rule testing_wildcards:
    output:
        newfile="mybucket/sample_{sample}.csv"
    params:
        additional="{sample}"
    run:
        df = dummy_example(params.additional)
        df.to_csv(output.newfile, index = False)

which gives me the following error:

Wildcards in input files cannot be determined from output files: 'sample'

I followed the doc and put expand in output section. For the params, it looked like this section and this thread was giving me everything needed

sample_list = ["A","B"]

import pandas as pd
import re

def dummy_example(sample):
    return pd.DataFrame({"values": [0,1], "sample": sample})
    
def get_wildcard_from_output(output):
    return re.search(r'sample_(.*?).csv', output).group(1)

rule all:
    input:
        expand("sample_{sample}.csv", sample = sample_list)

rule testing_wildcards:
    output:
        newfile=expand("sample_{sample}.csv", sample = sample_list)
    params:
        additional=lambda wildcards, output: get_wildcard_from_output(output)
    run:
        print(params.additional)
        df = dummy_example(params.additional)
        df.to_csv(output.newfile, index = False)

InputFunctionException in line 16 of /home/jovyan/work/Snakefile: Error: TypeError: expected string or bytes-like object Wildcards:

Is there some way to catch the value of the wildcard in params to apply the value in run ?

linog
  • 5,786
  • 3
  • 14
  • 28

1 Answers1

2

I think that you are trying to get the sample wildcard to use as a parameter in your script.

The wc variable is an instance of snakemake.io.Wildcards which is a snakemake.io.Namedlist. You can call .get(key) on these objects, so we can use a lambda function to generate the params.

samples_from_wc=lambda wc: wc.get("sample") and use this in the run/shell as params.samples_from_wc.

sample_list = ["A","B"]

import pandas as pd

def dummy_data(sample):
    return pd.DataFrame({"values": [0, 1], "sample": sample})

rule all:
    input: expand("sample_{sample}.csv", sample=sample_list)

rule testing_wildcards:
    output:
        newfile="sample_{sample}.csv"
    params:
        samples_from_wc=lambda wc: wc.get("sample")
    run:
        # Load input
        df = dummy_data(params.samples_from_wc)
        # Write output
        df.to_csv(output.newfile, index=False)
Alex
  • 6,610
  • 3
  • 20
  • 38
  • Thanks a lot for your answer, it worked ! To check if I have understood, `expand` should be on `all` rule, simplifying the reference to `sample` values in output (`{sample}`) or in other part of the script (by using `wc` instance) : that's right ? – linog Mar 03 '22 at 09:03
  • 1
    Expand is typically on `rule all` unless, for example, you are aggregating everything in a final step then it might be on the second-to-last rule. And yes, I think you understand, snakemake will add the `sample` value for inputs/outputs, but if you need to use the value directly you need to get it from the wildcards. – Alex Mar 03 '22 at 10:41