Is it possible to use wildcard_constraints to exclude certain keywords from being matched Snakemake?

Question

I have a rule that calculates new variables based on a set of variables, these variables are separated into different files. I have another rule for calculating the averages of all the different variables in the whole database. My problem is that snakemake tries to find my derived variables in the original database, which of course are not there.

Is there a way to have constrained the averaging rule such that it will calculate the average for all variables except for a list of the variables that are derived

Psudo code of how the rule look like

rule calc_average:
    input:
        pi_clim_var = lambda w: get_control_path(w, w.variable),
    output:
        outpath = outdir+'{experiment}/{variable}/{variable}_{experiment}_{model}_{freq}.nc'
    
    log:
        "logs/calc_average/{variable}_{model}_{experiment}_{freq}.log"
    wildcard_constraints:
        variable= '!calculated1!calculated2' # "orrvar1|orrvar2".... 

    notebook:
        "../notebooks/calc_clim.py.ipynb"

I can of make a list of all the variables that I would like to have in the database and then do:

wildcard_constraints:
    variable="|".join(list_of_vars)

But I was wondering if it is possible to do it the other way round? E.g:

wildcard_constraints:
    variable="!".join(negate_list_of_vars) # don't match these wildcards

EDIT: The get_control_path(w, w.variable) constructs the path to the input file based on a lookup table that uses the wildcards as a keys.

def get_control_path(w, variable, grid_label=None):
    if grid_label == None:
        grid_label = config['default_grid_label']
    try:
        paths = get_paths(w, variable,'piClim-control', grid_label,activity='RFMIP', control=True)
    except KeyError:
        paths = get_paths(w, variable,'piClim-control', grid_label,activity='AerChemMIP', control=True)
    return paths

def get_paths(w, variable,experiment, grid_label=None, activity=None, control=False):
    """
    Get CMIP6 model paths in database based on the lookup tables.

    Parameters:
    -----------
        w : snake.wildcards
                a named tuple that contains the snakemake wildcards

    """
    if w.model in ["NorESM2-LM", "NorESM2-MM"]:
        root_path = f'{ROOT_PATH_NORESM}/{CMIP_VER}'
        look_fnames = LOOK_FNAMES_NORESM
    else:
        root_path = f'{ROOT_PATH}/{CMIP_VER}'
        look_fnames = LOOK_FNAMES
    if activity:
        activity= activity
    else:
        activity = LOOK_EXP[experiment]
    model = w.model
    if control:
        variant=config['model_specific_variant']['control'].get(model, config['variant_default'])
    else:
        variant = config['model_specific_variant']['experiment'].get(model, config['variant_default'])
    table_id = TABLE_IDS.get(variable,DEFAULT_TABLE_ID)
    institution = LOOK_INSTITU[model]
    try:
        file_endings = look_fnames[activity][model][experiment][variant][table_id]['fn']
    except:
        raise KeyError(f"File ending is not defined for this combination of {activity}, {model}, {experiment}, {variant} and {table_id} " +
                        "please update config/lookup_file_endings.yaml accordingly")
    if grid_label == None:
        grid_label = look_fnames[activity][model][experiment][variant][table_id]['gl'][0]
    check_path = f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}'
    if os.path.exists(check_path)==False:
        grid_labels = ['gr','gn', 'gl','grz', 'gr1']
        i = 0
        while os.path.exists(check_path)==False and i < len(grid_labels):
            grid_label = grid_labels[i]
            check_path = f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}'
            i += 1 
    if control:
        version = config['version']['version_control'].get(w.model, 'latest')
    else:
        version = config['version']['version_exp'].get(w.model, 'latest')
    fname = f'{variable}_{table_id}_{model}_{experiment}_{variant}_{grid_label}'
    paths = expand(
            f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}/{version}/{fname}_{{file_endings}}'
            ,file_endings=file_endings)
    # Sometimes the verisons are just messed up... try one more time with latest
    if not os.path.exists(paths[0]):
        paths=expand(
        f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}/latest/{fname}_{{file_endings}}'
        ,file_endings=file_endings)
        # Sometimes the file ending are different depending on varialbe
        if not os.path.exists(paths[0]) and len(paths) >= 2:
            paths = [paths[1]]

    return paths

You could try to look into negative lookahead and lookbehind regex (e.g. [here](https://superuser.com/questions/477463/is-it-possible-to-use-not-in-a-regular-expression-in-textmate) or [here](https://stackoverflow.com/questions/9952169/negative-look-ahead-python-regex), google for more). However, there maybe better solutions. I would post an example of your files and directory setup to allow people understand better (at least, I'm a bit confused...) — dariober, Apr 28 '22 at 17:39

score 0 · Answer 1 · answered Apr 28 '22 at 18:22

Your description is a little too abstract for me to fully grasp your intent. You may be able to use regex's to solve this, but depending on the number of variables to consider, it could be a very slow calculation to match. Here are some other ideas, if they don't seem right please update your question with a little more context (the rule requesting calc_average and the get_control_path function).

Place derived and original files in different subdirectories. Then you can restrict the average rule to just be the original files.
Incorporate the logic into an input function/expand. Say the requesting rule is doing something like

rule average:
    input: expand('/path/to/{input}', input=[input for input in inputs if input not in negate_list_of_vars])
    output: 'path/to/average'

Thanks for the suggestion, I noticed that I'm using the same output file format for the calculated and averaged variables. I guess the simplest solution would just be to change the output format for one of the rules. — ovewh, Apr 29 '22 at 07:42

Is it possible to use wildcard_constraints to exclude certain keywords from being matched Snakemake?

1 Answers1