0

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.

df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)

df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df = client.persist(df)



def create_sep_futures(symbol,df):    

     symbol_df = copy.deepcopy(df[df['symbol' == symbol]])

     return symbol_df
lazy_values = [delayed(create_sep_futures)(symbol, df) for symbol in st]

 future = client.compute(lazy_values)
 result = client.gather(future)

st list contains 1000 elements

when I do this, I get this error:

 distributed.worker - WARNING -  Compute Failed
 Function:  create_sep_futures
 args:      ('PHG',       symbol  col_3  col_2  \
 0                A            1.451261e+09                23.512857   
 1                A            1.451866e+09                23.886857   
 2                A            1.452470e+09                25.080429   

 kwargs:    {}
 Exception: KeyError(False,)

My assumption is that workers should get full dataframe and query on it. But I think it just gets the block and tries to do it.

What is the workaround for it? Since dataframe chunks are already in workers memory. I don't want to move the dataframe to each worker.

Simon Featherstone
  • 1,716
  • 23
  • 40

1 Answers1

0

Operations on dataframes, using the dataframe syntax and API, are lazy (delayed) by default, you need do nothing more.

First problem: your syntax is wrong df[df['symbol' == symbol]] => df[df['symbol'] == symbol]. That is the origin of the False key.

So the solution you are probably looking for:

future = client.compute(df[df['symbol'] == symbol])

If you do want to work on the chunks separately, you can look into df.map_partitions, which you use with a normal function and takes care of passing data or delayed/futures or df.to_delayed, which will give you a set of delayed objects which you can use with a delayed function.

mdurant
  • 27,272
  • 5
  • 45
  • 74