8

I am trying to aggregate a dask dataframe to set of metrics, including median, but it looks like that median is not supported. Any chance to aggregate and get median?

st_agg = df.groupby(['start station id', 'end station id']).agg({'usertype':'count', 'tripduration':'median'})

>>> ValueError: unknown aggregate median
Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44

1 Answers1

0

As of October 6, 2021 there is not yet an implementation of this in Dask. There is an open Feature Request here.

Workaround for Specific Cases

From that same issue, this code below works for specific use cases in which the data for each grouped column fits on exactly 1 partition:

ddf = dask.datasets.timeseries()
ddf = ddf.set_index('id')

median_fun = dd.Aggregation(
    name="median",
    # this computes the median on each partition
    chunk=lambda s: s.median(),
    # this combines results across partitions; the input should just be a list of length 1
    agg=lambda s0: s0.sum(),
)

median_ddf = ddf.groupby("id")["x"].agg(median_fun)

General Solution

For larger datasets, you could construct a custom aggregation function that calculates the median (or the 50th percentile) using ``dd.groupby.Aggregation`. If you do this, consider submitting it as a PR to resolve the feature request listed above.

See docs here: https://docs.dask.org/en/stable/generated/dask.dataframe.groupby.Aggregation.html#dask-dataframe-groupby-aggregation

Median vs 50th Percentile

Note that for most practical purposes, the 50th percentile and the median are equivalent when working with large datasets: https://math.stackexchange.com/questions/2048470/is-50th-percentile-equal-to-median

rrpelgrim
  • 342
  • 2
  • 13