3

I have a Snakemake workflow that I've been using to train DL TensorFlow models. At a high level there are a few longish-running jobs (model training) that can be run in parallel. I would like to run these on the cloud and dask-cloudprovider seems like a promising option since I can leverage GPU's easily on ECS. To do this, though, would I have to rewrite my workflow using the Dask functions (maybe dask delayed)? Or is there some way to get Snakemake to use Dask?

j sad
  • 1,055
  • 9
  • 16

2 Answers2

3

If you do a web search for "dask snakemake" you'll find a Github issue from 2017 that you might want to read through. It's certainly possible, but someone would need to write the integration.

You may also want to try Dask's integration with Airflow, or, perhaps a bit more modern, the Prefect library.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks, @MRocklin, for pointing me to that issue (https://github.com/dask/dask/issues/2119). I also found another really good discussion here: https://github.com/pangeo-data/pangeo/issues/523. Also, thanks for pointing me to Airflow and Prefect. I think for what I'm trying to do, Snakemake and Dask would be filling the same role. I was thinking I could use Dask to spin off the workers, but maybe that is over-complicating things. – j sad May 11 '20 at 19:31
-1

I've never heard of dask before and I don't use the cloud so I may be completely off here.

I don't see why snakemake and dask shouldn't play well with each other. Can't you do:

rule one:
    input: ...
    output: 'out.txt',
    run:
        from dask_cloudprovider import FargateCluster
        # Do stuff

rule two:
    input:
        'out.txt',
    output:
        ...
    run:
        # Do stuff with out.txt
dariober
  • 8,240
  • 3
  • 30
  • 47