0

I'm trying to turn this pandas to dask to accelerate this loop who will work on 40 millions datas (any better tricks to go faster is welcome!)

the pandas version work well but the Dask one have an error. I'm first time sussing Dask so I don't have the "feeling" of how make it work.

Pandas original code:

   for bv in df_2_transform.index.unique():
      # everywhere row with index==bv  & date==august write 100
      df.loc[bv and (pd.to_datetime(df['date']).dt.month == 8), v_n] = 100

my Dask attempt:

   for bv in df_2_transform.index.unique():

      df_receveur[v_n] = 
              df[v_n].mask(bv and (dd.to_datetime(df['date']).dt.month == 8), 100)

where : v_n = name of a column I got theses errors messages:

ValueError: Metadata inference failed in `mask`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
ValueError('Must specify axis=0 or 1')

Thank for your help

Jonathan Roy
  • 405
  • 1
  • 6
  • 18
  • 1
    this is a pretty good summary: https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument – Michael Delgado Nov 21 '22 at 23:23
  • Does this answer your question? [How to map a column with dask](https://stackoverflow.com/questions/40019905/how-to-map-a-column-with-dask) – Michael Delgado Nov 21 '22 at 23:24
  • If I understand, I must specify at first my type of data I will write in de DF that right? – Jonathan Roy Nov 22 '22 at 13:36
  • 1
    yep. dask depends on knowing the shape and types of results ahead of time so it can schedule operations and allocate arrays *before* executing the computation. So if you pass in a custom operation which dask doesn't know about, you need to specify the output types manually. – Michael Delgado Nov 22 '22 at 17:38
  • in your code, you have `bv and (dd.to_datetime(df['date']).dt.month == 8)`. do you mean to use the bitwise operator `&`, e.g. `bv & (dd.to_datetime(df['date']).dt.month == 8)`? or do you really intend to use a single boolean in your mask? – Michael Delgado Nov 22 '22 at 17:38
  • not sure I compleatly understand a 'bitwise operator' but my understanding is I intend to use a single boolean – Jonathan Roy Nov 23 '22 at 19:49

0 Answers0