Dask Distributed - Same persist data multiple clients

Question

We are trying Dask Distributed to make some heavy computes and visualization for a frontend.

Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist into the cluster.

I've tried using pickle to save the futures from the persist dataframe, but it doesn't work.

We want to have multiple gunicorn workers, each with a different client connecting to the same cluster and using the same data, but with more workers each one uploads a new dataframe.

score 1 · Answer 1 · answered May 11 '19 at 00:18

1

It sounds like you are looking for Dask's abililty to publish datasets

A convenient way to do this is to using the client.datasets mapping

Client 1

client = Client('...')
df = dd.read_csv(...)
client.datasets['my-data'] = df

Client 2..n

client = Client('...')  # same scheduler
df = client.datasets['my-data']

answered May 11 '19 at 00:18

MRocklin

55,641
23
163
235

This is just was I needed Thank you very much! – CValenzu May 15 '19 at 22:37

Dask Distributed - Same persist data multiple clients

1 Answers1