3

I have a snakemake pipeline where I need to do a small step of processing the data (applying a rolling average to a dataframe).

I would like to write something like this:

rule average_df:
    input:
        # script = ,
        df_raw = "{sample}_raw.csv"
    params:
        window = 83
    output:
        df_avg = "{sample}_avg.csv"
    shell:
        """
        python
        import pandas as pd
        df=pd.read_csv("{input.df_raw}")
        df=df.rolling(window={params.window}, center=True, min_periods=1).mean()
        df.to_csv("{output.df_avg}")
        """

However it does not work.

Do I have to create a python file with those 4 lines of code? The alternative that occurs to me is a bit cumbersome. It would be

average_df.py

import pandas as pd


def average_df(i_path, o_path, window):

        df=pd.read_csv(path)
        df=df.rolling(window=window, center=True, min_periods=1).mean()
        df.to_csv(o_path)

        return None


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description='Description of your program')
    parser.add_argument('-i_path', '--input_path', help='csv file', required=True)
    parser.add_argument('-o_path', '--output_path', help='csv file ', required=True)
    parser.add_argument('-w', '--window', help='window for averaging', required=True)


    args = vars(parser.parse_args())

    i_path = args['input_path']
    o_path = args['output_path']
    window = args['window']

    average_df(i_path, o_path, window)


And then have the snakemake rule like this:

rule average_df:
    input:
        script = average_df.py,
        df_raw = "{sample}_raw.csv"
    params:
        window = 83
    output:
        df_avg = "{sample}_avg.csv"
    shell:
        """
        python average_df.py --input_path {input.df_raw} --ouput_path {output.df_avg} -window {params.window}
        """

Is there a smarter or more efficient way to do this? That would be great! Looking forward to your input!

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Ulises Rey
  • 75
  • 8
  • 1
    Use `run:` instead of `shell:` see: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-and-rules – Alex Feb 14 '23 at 10:20

2 Answers2

2

This can be achieved via run directive:

rule average_df:
    input:
        # script = ,
        df_raw = "{sample}_raw.csv"
    params:
        window = 83
    output:
        df_avg = "{sample}_avg.csv"
    run:
        import pandas as pd
        df=pd.read_csv(input.df_raw)
        df=df.rolling(window=params.window, center=True, min_periods=1).mean()
        df.to_csv(output.df_avg)

Note that all snakemake objects are available directly via input, output, params, etc.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • 1
    Great Thanks, that is what I was looking for! What do you mean with being directly available? Or what is the difference between input.df_raw and {input.df_raw}? – Ulises Rey Feb 14 '23 at 10:21
  • 1
    The curly bracket syntax is used when substituting values in `shell` directive, but in `run` directive you can directly call the relevant object/wildcard. So in `run` you would call `input.df_raw` rather than `"{input.df_raw}"` (which would be used in `shell`). – SultanOrazbayev Feb 14 '23 at 10:23
  • 1
    I understand. So in your answer the params.window would also be without {}, correct? Thanks! – Ulises Rey Feb 14 '23 at 10:24
  • 1
    You are correct, I fixed the code. – SultanOrazbayev Feb 14 '23 at 10:26
1

The run directive seems the way to go. It may be good to know that you could do the same using the -c argument in python to run a script passed as a string. E.g.:

shell:
        r"""
python -c '
import pandas as pd
df=pd.read_csv("{input.df_raw}")
etc etc...
'
        """ 
dariober
  • 8,240
  • 3
  • 30
  • 47