Use a Dask Cluster in a PythonScriptStep

Question

Is it possible to have a multi-node Dask cluster be the compute for a PythonScriptStep with AML Pipelines?

We have a PythonScriptStep that uses featuretools's, deep feature synthesis (dfs) (docs). ft.dfs() has a param, n_jobs which allows for parallelization. When we run on a single machine, the job takes three hours, and runs much faster on a Dask. How can I operationalize this within an Azure ML pipeline?

score 6 · Accepted Answer · answered Aug 07 '20 at 18:08

We've been working and recently released a dask_cloudprovider.AzureMLCluster that might be of interest to you: link to repo. You can install it via pip install dask-cloudprovider.

The AzureMLCluster instantiates Dask cluster on AzureML service with elasticity of scaling up to 100s of nodes should you require that. The only required parameter is the Workspace object, but you can pass your own ComputeTarget should you choose to.

An example of how to use it you can found here. In this example I use my custom GPU/RAPIDS docker image but you can use any images within the Environment class.

Thanks for pointing to dask_cloudprovider. But, the sample code (https://github.com/drabastomek/GTC/blob/master/SJ_2020/workshop/1_Setup/Setup.ipynb) doesn't describe how to use AzureML Pipeline/PythonScriptStep on top of Dask Cluster. Any pointer would be appreciated. — Arnab Biswas, Nov 10 '20 at 04:37

Use a Dask Cluster in a PythonScriptStep

1 Answers1