6

I'm trying to port parts of my application from pandas to dask and I hit a roadblock when using a lamdba function in a groupby on a dask DataFrame.

import dask.dataframe as dd

dask_df = dd.from_pandas(pandasDataFrame, npartitions=2)
dask_df = dask_df.groupby(
                        ['one', 'two', 'three', 'four'],
                        sort=False
                    ).agg({'AGE' : lambda x: x * x })

This code fails with the following error:

ValueError: unknown aggregate lambda

My lambda function is more complex in my application than here, but the content of the lambda doesn't matter, the error is always the same. There is a very similar example in the documentation, so this should actually work, I'm not sure what I'm missing.

The same groupby works in pandas, but I need to improve it's performance.

I'm using dask 0.12.0 with python 3.5.

barney.balazs
  • 630
  • 7
  • 19

1 Answers1

0

From the Dask docs:

"Dask supports Pandas’ aggregate syntax to run multiple reductions on the same groups. Common reductions such as max, sum, list and mean are directly supported.

Dask also supports user defined reductions. To ensure proper performance, the reduction has to be formulated in terms of three independent steps. The chunk step is applied to each partition independently and reduces the data within a partition. The aggregate combines the within partition results. The optional finalize step combines the results returned from the aggregate step and should return a single final column. For Dask to recognize the reduction, it has to be passed as an instance of dask.dataframe.Aggregation.

For example, sum could be implemented as:

custom_sum = dd.Aggregation('custom_sum', lambda s: s.sum(), lambda s0: s0.sum())
df.groupby('g').agg(custom_sum)

"

rrpelgrim
  • 342
  • 2
  • 13