0

I'm experimenting with Dask by running a local cluster with four workers on my laptop.

I distribute a Pandas dataframe between the workers, but when I run a function on them I see from the dashboard that only one of them is actually used.

What am I missing?

Here the code

from distributed import Client
client = Client('127.0.0.1:56947')
dd = client.scatter(df, broadcast=True) # df is a pandas Dataframe
r = client.submit(process_df, dd) 
Vincenzo Lavorini
  • 1,884
  • 2
  • 15
  • 26

1 Answers1

3

This line

dd = client.scatter(df, broadcast=True)

copied df to each of your workers. However, it is a single entity, end you are submitting one task to work on it. A task is the unit of granularity in dask, and won't be split up by Dask.

What you wanted to do is split up your dataframe into partitions. You can do this yourself (df.loc[..]), but there is also a dask.dataframe specifically for doing this kind of manipulation: for example, replacing your existing pandas.read_csv with dask.dataframe.read_csv.

mdurant
  • 27,272
  • 5
  • 45
  • 74