I have a multi-index dask dataframe, which I need to perform a groupby, followed by a diff on. This operation is trivial in pure pandas via the following command:
df.groupby('IndexName')['ValueName'].diff().
Dask, however, doesn't implement the diff function on SeriesGroupBy objects. I've attempted to implement my own with the following command:
df.groupby('IndexName')['ValueName'].apply(lambda x: x.diff(1) )
but this yields the following error:
ValueError: Wrong number of items passed 0, placement implies 3987
Any Ideas:
Below is the sample dataframe:
dummy = {
'Index1' : pd.DataFrame({'A' : np.arange(10),'ValueName': np.random.rand(10)}),
'Index2' : pd.DataFrame({'A' : np.arange(5),'ValueName': np.random.rand(5)})
}
pdf = pd.concat(dummy,names=['IndexName'])
def getDummy(f):
return f
dfs = [delayed(getDummy)(f) for f in [pdf]]
#NOTE: dd.from_pandas doesn't support multiindex...but delayed does
df = dd.from_delayed(dfs)