4

I have a Snakemake recipe which contains a very expensive preparatory step, common for all its calls. Here is a pseudorule for demonstration sake:

rule sample:
    input:
        "{name}.config"
    output:
        "{name}.npz"
    run:
        import somemodule
        
        data = somemodule.Loader("some_big_data")  # expensive
        np.savez(output, data.process(input))  # also expensive

At the moment data is loaded de novo for every target, which is pretty suboptimal. How can I make it to be loaded only once?

I look for something which allows to rewrite the rule like that:

rule sample:
    input:
        "{name}.config"
    output:
        "{name}.npz"
    setup:
        import somemodule
        
        data = somemodule.Loader("some_big_data")  # expensive
    run:
        np.savez(output, data.process(input))  # also expensive

or:

rule sample:
    input:
        "{name}.config"
    output:
        "{name}.npz"
    run:
        import somemodule

        data = somemodule.Loader("some_big_data")  # expensive
        
        for job in jobs:
            np.savez(job.output,
                     data.process(job.input))  # also expensive

In another question I have described the code Loader.__init__() is based on.

abukaj
  • 2,582
  • 1
  • 22
  • 45
  • You could load it outside the rule and then pass it as a `param`? – Michael Hall Aug 04 '21 at 23:37
  • @MichaelHal I am not sure if I got what you mean, but wouldn't that make it load the data every time I call `snakemake` (even if there is nothing to do)? – abukaj Aug 05 '21 at 11:13
  • Yes, that is true. The only other way I can think of speeding this up is by improving the speed of the `somemodule.Loader` function. Does it just load a file into memory? Or does it load a file and do some processing on the file contents before returning `data`? – Michael Hall Aug 05 '21 at 23:07
  • @MichaelHall it loads [_FEniCS_](https://fenicsproject.org/)' mesh and related objects (`FunctionSpace`, `MeshFunctionSizet`), which are used by the `.process()` method to solve equation parametrized by the input file. I think there is a lot of processing than just reading the binary data. – abukaj Aug 07 '21 at 15:57
  • @MichaelHall If you are interested in the most important code from the constructor, please see my other question: https://stackoverflow.com/questions/68694729/how-can-i-load-fenics-objects-faster – abukaj Aug 07 '21 at 17:39

1 Answers1

1

One possible solution is to create a pickled object with the data of interest. Please research the security considerations of using pickled objects to check that it is acceptable for your case. If it is, then it would be along the following lines:

rule sample:
    input:
        "{name}.config"
    output:
        pickle = "{name}.pickle",
    run:
        import somemodule
        import pickle
        
        data = somemodule.Loader("some_big_data")  # expensive
        pickle.dump(pickle, output.pickle)

In downstream rules you would reference the pickled file like any other file, just making sure to load it with pickle.load.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46