Use a shell script to obtain a params for every wildcard

Question

I have an R script in my workflow that requires the number of entries on several csv files (a.csv, b.csv, c.csv; formatted with headers) as a value. Since they all have a string something in every line, I thought I could write the rule as follows:

configfile: config.yaml
WILDCARD = config['wildcard']
TEMP_DIR = "~/temp"

rule all:
    input:
        f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"

rule combine_geno_pheno_data_sibs:
    input:
        f"{TEMP_DIR}/file.txt",
        f"{TEMP_DIR}/folder/{{WILDCARD}}/another_file.txt",
        f"config['file']",
    output:
        f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
    params:
        n_lines = shell(
            "grep -c something ../resources/{{WILDCARD}}.csv | xargs"
        )
    script:
        "scripts/use_lines.R"

config.yaml contains

wildcard:
   - a
   - b
   - c

and n_lines is called in R as snakemake@params$n_lines. The way the expansion is interpreted in shell(), though, is as grep -c something ../resources/a b c.csv, how do I get it to interpret the wildcards as e.g. grep -c something ../resources/a.csv and return the value to n_lines correctly?

Thanks in advance

score 1 · Answer 1 · answered Aug 07 '22 at 08:27

And I didn't really follow what your pipeline does but perhaps this is what you want?

configfile: config.yaml
WILDCARD = ['a', 'b', 'c']
TEMP_DIR = "~/temp"

rule all:
    input:
        expand("{TEMP_DIR}/folder/{WILDCARD}/output.txt", TEMP_DIR=TEMP_DIR, WILDCARD=WILDCARD),

rule combine_geno_pheno_data_sibs:
    input:
        somefile= "{TEMP_DIR}/file.txt",
        anotherfile= "{TEMP_DIR}/folder/{WILDCARD}/another_file.txt",
        csv= lambda wc:  '../resources/%s.csv' % wc.WILDCARD,
    output:
        "{TEMP_DIR}/folder/{WILDCARD}/output.txt"
    script:
        "scripts/use_lines.R"

I would get the n_lines counts inside the R script with standard R code. E.g.

csv <- read.table(snakemake@input$csv, ...)
n_lines <- nrow(csv[csv$somecol == 'something', ])

I don't have much sympathy for script directive because it can see everything in the snakemake workflow so it is not clear what it takes in input. I prefer to write a standalone R script that you execute via shell directive. But this is unrelated to your question...

(Note that in your rule all you miss the input directive but I guess this is a typo).

SultanOrazbayev · Answer 2 · 2022-08-11T02:46:15.653

0

Given the config.yaml, the WILDCARD contains a list of string values:

WILDCARD = config['wildcard']
# WILDCARD = ['a', 'b', 'c']

By applying f-string formatting to rule all input, you are really requesting the following file:

rule all:
   input: f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
#        "~/temp/folder/{WILDCARD}/output.txt"

It's not clear if you are then requesting specific files from command line or if the code in the question is slightly inconsistent, but to fix your rule, consider defining a utility function that counts number of lines containing "something":

def _count_lines(f):
    path = f"../resources/{f}.csv"
    num_lines = sum(1 for line in open(path) if "something" in line)
    return num_lines

Then the rule combine_geno_pheno_data_sibs will look like:

rule combine_geno_pheno_data_sibs:
    input:
        f"{TEMP_DIR}/file.txt",
        f"{TEMP_DIR}/folder/{{WILDCARD}}/another_file.txt",
        f"config['file']",
    output:
        f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
    params:
        n_lines = lambda wildcards: _count_lines(wildcards.WILDCARD)
    script:
        "scripts/use_lines.R"

edited Aug 11 '22 at 02:46

answered Aug 06 '22 at 10:24

SultanOrazbayev

14,900
3
16
46

The first two bits of the questions are as intended, as I need different outputs for different values of both `WILDCARD` and, consequently, `n_lines`. What is the inconsistency that you noticed? – Giulio Centorame Aug 06 '22 at 10:54
Also I don't think your function would count lines in the `.csv` file as is. Maybe calling it as `lambda wildcards : _count_lines(f"../resources/{wildcards.WILDCARD}.csv")`? – Giulio Centorame Aug 06 '22 at 11:04
I am still having issues with this answer, as it doesn't recognise `wildcards` – Giulio Centorame Aug 11 '22 at 00:09
Hmmm, what's the error you are receiving? – SultanOrazbayev Aug 11 '22 at 02:48
`Snakemake InputFunctionException. AttributeError: 'wildcards' object has no attribute 'WILDCARD'` – Giulio Centorame Aug 11 '22 at 05:56
This error suggests that there is an inconsistency between the code shared in the question and the code you are executing (specifically in naming of wildcards). – SultanOrazbayev Aug 11 '22 at 05:57

Use a shell script to obtain a params for every wildcard

2 Answers2