3

I have created a single (remote) scheduler and ten worker on different machines on the same network and try to distribute a dataframe from a client. My problem is that it takes 30min to do the scatter.

from dask.distributed import Client
df = pd.DataFrame({ i : range(10) for i in range(10)})
client = Client(scheduler_addr)
future = client.scatter(df, broadcast=True)

This code works but it is too slow to be usable - with broadcast=False it works reasonably fast. I have created Scheduler and Worker both with default arguments. How should it be done instead?

my dask.distributed version is 2022.01.0

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Philipp -
  • 33
  • 4

1 Answers1

1

Scatter with broadcasting should be very fast for small objects, but will be slower on large objects.

One way to avoid sending large objects across the network is to store them at a common location and instruct workers to load these objects directly:

df = pd.DataFrame({ i : range(10) for i in range(10)})
df.to_parquet('my_file.parquet')

def run_batch(n):
    df = pd.read_parquet('my_file.parquet')
    ...

client = Client()
futures = client.map(run_something, range(10))

There is also a hack for this use-case, once_per_worker, see this blog post.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks, delivering the data from a server is an alternative but it seems like this should work out of the box. At first I thought it was the size of the data, but it is incredibly slow in my configuration to "broadcast" even such a small object. I am wondering if I need to tweak some configuration? – Philipp - May 09 '22 at 14:27
  • Ah, so even a 10 by 10 dataframe is slow to send? That should have been fast... – SultanOrazbayev May 09 '22 at 14:58
  • exactly - client.scatter(..., broadcast=False) runs in milliseconds, broadcast=True takes half an hour (all workers are on the same network). How I could go about debugging this? – Philipp - May 09 '22 at 16:35
  • Not sure, but one relatively quick thing to try is to upgrade dask to the latest version... – SultanOrazbayev May 09 '22 at 16:39
  • 1
    the solution was to upgrade to 2022.05 - not sure why this feature could have been broken before. – Philipp - May 24 '22 at 22:19
  • 1
    In case anyone else comes on this… one major implication of `broadcast=True` is it will wait for workers to show up. So if there is something about worker spinup or dask version incompatibility, this can cause scatter to hang. It’s possible this is partly to blame for the long (or indefinite) hang up. – Michael Delgado May 29 '22 at 19:59