6

In dask distributed I get the following warning, which I would not expect:

/home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.95 MB detected in task graph: 
  (['int-58e78e1b34eb49a68c65b54815d1b158', 'int-5cd ... 161071d7ae7'],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s))

The reason I'm suprised is, that I'm doing exactly what the warning is suggesting:

import dask.dataframe as dd
import pandas
from dask.distributed import Client, LocalCluster

c = Client(LocalCluster())
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
filter_list = c.scatter(list(range(2,100000,2)))
mask = c.submit(dask_df['A'].isin, filter_list)
dask_df[mask.result()].compute()

So my question is: Am I doing something wrong or is this a bug?

pandas='0.22.0'
dask='0.17.0'
dennis-w
  • 2,166
  • 1
  • 13
  • 23

1 Answers1

4

The main reason why dask is complaining isn't the list, it's the pandas dataframe inside the dask dataframe.

dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)

You are creating a biggish amount of data locally when you create a pandas dataframe in your local session. Then you operate with it on the cluster. This will require moving your pandas dataframe to the cluster.

You're welcome to ignore these warnings, but in general I would not be surprised if performance here is worse than with pandas alone.

There are a few other things going on here. Your scatter of a list produces a bunch of futures, which may not be what you want. You're calling submit on a dask object, which is usually unnecessary.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks for your answer. The real dataframe in my production environment is one i get from reading over a bunch of parquet files. When I'm using isin directly without scattering first, the warning also appears. Using isin directly with scattering is not possible because dask dataframe isin does'nt accept futures. So what would be the most elgeant way to perform the task? – dennis-w Feb 22 '18 at 15:00
  • 1
    Yeah, looks like `isin` is trying out the operation on a small pandas dataframe and failing there. I'm not sure how to improve the situation. Looks like a bug. I would raise a bug report at https://github.com/dask/dask/issues/new – MRocklin Feb 22 '18 at 16:45
  • Thank you very much! I will to that. – dennis-w Feb 23 '18 at 07:42