Dask DataFrame aggregate to median

Question

I am trying to aggregate a dask dataframe to set of metrics, including median, but it looks like that median is not supported. Any chance to aggregate and get median?

st_agg = df.groupby(['start station id', 'end station id']).agg({'usertype':'count', 'tripduration':'median'})

>>> ValueError: unknown aggregate median

You may be surprised to learn that the parallel computation of the median is quite challenging — MRocklin, May 05 '17 at 19:19
I understand that, but there are a few relatively cheap approximations for that. — Philipp_Kats, May 05 '17 at 23:13
Dask.dataframe supports approximate quantiles. I don't think that these are exposed through the groupby.agg interface. You might [open an issue](https://github.com/dask/dask/issues/new) requesting the feature. — MRocklin, May 06 '17 at 12:23
@Philipp_Kats did you manage to solve this issue? Currently struggeling with it, so if you have a solution... — Octavarium, May 27 '21 at 09:43
No, I didn't, but there is an example on approximate quantiles [here](https://docs.dask.org/en/latest/_modules/dask/array/percentile.html) — Philipp_Kats, May 27 '21 at 18:23

score 0 · Answer 1 · answered Oct 06 '21 at 08:19

As of October 6, 2021 there is not yet an implementation of this in Dask. There is an open Feature Request here.

Workaround for Specific Cases

From that same issue, this code below works for specific use cases in which the data for each grouped column fits on exactly 1 partition:

ddf = dask.datasets.timeseries()
ddf = ddf.set_index('id')

median_fun = dd.Aggregation(
    name="median",
    # this computes the median on each partition
    chunk=lambda s: s.median(),
    # this combines results across partitions; the input should just be a list of length 1
    agg=lambda s0: s0.sum(),
)

median_ddf = ddf.groupby("id")["x"].agg(median_fun)

General Solution

For larger datasets, you could construct a custom aggregation function that calculates the median (or the 50th percentile) using ``dd.groupby.Aggregation`. If you do this, consider submitting it as a PR to resolve the feature request listed above.

See docs here: https://docs.dask.org/en/stable/generated/dask.dataframe.groupby.Aggregation.html#dask-dataframe-groupby-aggregation

Median vs 50th Percentile

Note that for most practical purposes, the 50th percentile and the median are equivalent when working with large datasets: https://math.stackexchange.com/questions/2048470/is-50th-percentile-equal-to-median

Dask DataFrame aggregate to median

1 Answers1