3

I have a multi-index dask dataframe, which I need to perform a groupby, followed by a diff on. This operation is trivial in pure pandas via the following command:

df.groupby('IndexName')['ValueName'].diff().

Dask, however, doesn't implement the diff function on SeriesGroupBy objects. I've attempted to implement my own with the following command:

df.groupby('IndexName')['ValueName'].apply(lambda x: x.diff(1) )

but this yields the following error:

ValueError: Wrong number of items passed 0, placement implies 3987

Any Ideas:

Below is the sample dataframe:

dummy = {
    'Index1' : pd.DataFrame({'A' : np.arange(10),'ValueName': np.random.rand(10)}),
    'Index2' : pd.DataFrame({'A' : np.arange(5),'ValueName': np.random.rand(5)})
}
pdf = pd.concat(dummy,names=['IndexName'])
def getDummy(f):
    return f
dfs = [delayed(getDummy)(f) for f in [pdf]] 
#NOTE: dd.from_pandas doesn't support multiindex...but delayed does
df = dd.from_delayed(dfs)
IAS_LLC
  • 135
  • 11

0 Answers0