1

As title says I can't run this code:

def simple_map(x):
    y = seasonal_decompose(x,model='additive',extrapolate_trend='freq',period=7,two_sided=False)
    return y.trend

b.map_partitions(simple_map,meta=b).compute()

where b is a dask DataFrame with a datetime as index and some series of float as columns and seasonal_decompose is the statsmodel one.

This is what I get:

Index(...) must be called with a collection of some kind, 'seasonal' was passed

If I do:

b.apply(simple_map,axis=0)

Where b is a pandas DataFrame I get what I want.

Where I am wrong?

#

Reproducible example:

import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose

d = {'Val1': [3, 2,7,5], 'Val2': [2, 4,8,6]}
b=pd.DataFrame(data=d)
b=b.set_index(pd.to_datetime(['25/12/1991','26/12/1991','27/12/1991','28/12/1991']))

def simple_map(x):
    y =seasonal_decompose(x,model='additive',extrapolate_trend='freq',period=2,two_sided=False)
    return y.trend

b.apply(simple_map,axis=0)

            Val1    Val2
1991-12-25  0.70    0.9
1991-12-26  2.10    2.7
1991-12-27  3.50    4.5
1991-12-28  5.25    6.5

This is what i want do with dask but I cannot

In fact:

import dask.dataframe as dd

c=dd.from_pandas(b, npartitions=1)
c.map_partitions(simple_map,meta=c).compute()

Produce the above appointed error.

mat
  • 181
  • 14
  • 1
    Often it is best to provide a [minimal reproducible example](https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) to help diagnose/troubleshoot problems. I would suggest confirming things work with Pandas. Is `y` a dataframe or a column ? If it's a column, then `meta` should reflect this. Personally, I find using `make_meta` super helpful here: `from dask.dataframe.utils import make_meta` – quasiben May 28 '20 at 12:57
  • Ty, now it is more clear? y is a series, I suppose – mat May 28 '20 at 13:55
  • For this question, you specified `meta=c`. The input `dask.dataframe` is also named `c`. In general, the value passed to `meta` does not need to be the same as the input `dask.dataframe`. So, `meta` could also have been a `dask.dataframe` with different column names and dtypes than `c`. You could have also passed in `meta={'Val1__seasonal_decompose': float, 'Val2__seasonal_decompose': float}` - by doing this, it would be more explicit that the output is a `dask.dataframe` with columns that have been passed through seasonal decomposition via a moving average - this may be easier to interpret. – edesz May 31 '20 at 16:17
  • Ty, I apprecciate – mat Jun 01 '20 at 18:05

1 Answers1

1

Thank you for the example!

From the docstring of apply

Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0)

However, map_partitions is going to work on the entire Dataframe. I would suggest rewriting the function slightly:

 def simple_map_2(x):
     xVal1 = seasonal_decompose(x.Val1,model='additive',extrapolate_trend='freq',period=2,two_sided=False)
     xVal2 = seasonal_decompose(x.Val2,model='additive',extrapolate_trend='freq',period=2,two_sided=False)
     return pd.DataFrame({'Val1': xVal1.trend, 'Val2': xVal2.trend})

c.map_partitions(simple_map_2,meta=make_meta(c)).compute()

            Val1  Val2
1991-12-25  0.70   0.9
1991-12-26  2.10   2.7
1991-12-27  3.50   4.5
1991-12-28  5.25   6.5
quasiben
  • 1,444
  • 1
  • 11
  • 19
  • In this way work but the problem is that I have so much more columns! Should I cycle? and when I cycle on columns do I lose all parallelization gain? – mat May 29 '20 at 07:27
  • Dask maintains partitions on rows not columns -- a partition is a chunk of a full df. You still have parallelization on the partition level of a df, not the individual columns. For large enough data, yes, I would assume you still get a parallelization gain. Still I would recommend benchmarking if concerned. If you can break down your UDF into basic operations: sum/min/max , then you could operate on columns individually. https://stackoverflow.com/questions/52117218/how-to-apply-a-function-to-multiple-columns-of-a-dask-data-frame-in-parallel/52118450 shows an example for how to do this. – quasiben May 29 '20 at 19:52