0

I have some code which samples a record from a pandas.DataFrame for each record in a dask.DataFrame for k times.

But it throws a warning:

UserWarning: Large object of size 1.12 MB detected in task graph: 
  (       metric  label group_1 group_2
6251       1 ... 6f875181063ba')
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)

Trying to work around this (manually broadcasting the data) using:

 client.scatter(group_0, broadcast=True)

will still try to re-broadcast group_0. How can I tell dask to use the broadcasted one? Do I need to gather the scattered data? Could the code be optimized further?

See the code below:

import numpy as np
import pandas as pd

seed = 47
np.random.seed(seed)

size = 100000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] =  np.random.randint(0,2, size=size)
df['group_1'] =  pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] =  pd.Series(np.random.randint(1,10, size=size)).astype(object)
display(df.head())

group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
group_0 = group_0.rename(index=str, columns={"metric": "metric_group_0"})

join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric_group_0']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())

import dask.dataframe as dd
from dask.distributed import Client

client = Client()
display(client)
client.cluster


resulting_df = None
k = 3

def knnJoinSingle_series(original_element, group_0, join_columns, random_state):
    limits_dict = original_element[join_columns_enrich].to_dict()
    query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
    candidates = group_0.query(query)
    if len(candidates) > 0:
        return candidates.sample(n=1, random_state=random_state)['metric_group_0'].values[0]
    else:
        return np.nan

for i in range(1, k+1):
    print(i)
    # WARNING:not setting random state, otherwise always the same record is picked
    # in case of same values from group selection variables. Is there a better way?
    group_1_dask = dd.from_pandas(group_1, npartitions=8)
    group_1_dask['metric_group_0']= group_1_dask.apply(lambda x: 
                                           knnJoinSingle_series(x, group_0, join_columns_enrich, random_state=None),
                                           axis = 1, meta=('metric_group_0', 'int64'))
    group_1 = group_1_dask.compute()
    group_1['run'] = i

    if resulting_df is None:
        resulting_df = group_1
    else:
        resulting_df = pd.concat([resulting_df, group_1])

resulting_df['difference'] = resulting_df['metric'] - resulting_df['metric_group_0']
resulting_df['differenceAbs'] = np.abs(resulting_df['difference'])

display(resulting_df.head())
print(len(resulting_df))
print(resulting_df.difference.isnull().sum())
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Possibly duplicate of https://stackoverflow.com/questions/52997229/is-there-an-advantage-to-pre-scattering-data-objects-in-dask – Georg Heiler May 10 '19 at 09:48

1 Answers1

3

You will want to do the following before using the variable on your dask dataframe (probably immediately after creating the Client):

group0 = client.scatter(group_0, broadcast=True)

i.e., replace instances of your concrete dataframe with the future, a reference to the copies on the cluster. Dask will interpret this to use the local copy of the data within each worker.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • That fails with: AttributeError: ("'Future' object has no attribute 'query'", 'occurred at index 25004'). Do I need to unpack the future? – Georg Heiler May 10 '19 at 17:54
  • Er no, you need to use it only in dask manipulations. Your code is too complex for me to work out where this might be happening. – mdurant May 10 '19 at 18:01
  • I created a broadcasted variable: `group_0_scattered = client.scatter(group_0, broadcast=True)`. And pass it here: `group_1_dask = dd.from_pandas(group_1, npartitions=8) group_1_dask['metric_group_0']= group_1_dask.apply(lambda x: knnJoinSingle_series(x, group_0_scattered, join_columns_enrich, random_state=None), axis = 1, meta=('metric_group_0', 'int64')) group_1 = group_1_dask.compute()`. But this fails even though it is entirely happening inside dask. – Georg Heiler May 10 '19 at 18:07
  • please see https://stackoverflow.com/questions/56084548/dask-broadcast-not-available-during-compute-graph I rephrased the question, as this ist, not the whole truth. It seems to block over and over again in the apply function, i.e. it works, but is currently slower than a plain pandas implementation. – Georg Heiler May 10 '19 at 20:39