0

Suppose I have the following dataframe

df = pl.DataFrame({'x':[[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]]})

To get the nth percentile, I can do the following:

list_quantile_30 = pl.element().quantile(0.3)
df.select(pl.col('x').arr.eval(list_quantile_30))

But I can't figure out how to get the index corresponding to the percentile? Here is how I would do it using numpy:

import numpy as np
series = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
np.searchsorted(series, np.percentile(series, 30))

Is there a way to do this in a Polars way without using apply?

Scout
  • 27
  • 5
  • If your data is sorted, isn't identifying the position of the element at the 30% quantile the same thing as taking the length of the array and multiplying by 0.30, then rounding? – Nick ODell Aug 08 '22 at 15:44
  • Yes that's correct in the example but in my own use case I also have nan values which wouldn't work (I am basically using the idx as a column identifier.) – Scout Aug 08 '22 at 16:59

1 Answers1

1

Continuing from your example you could use pl.arg_where to search for a condition.

df = pl.DataFrame({'x':[[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]]})

list_quantile_30 = pl.element().quantile(0.3)

df.with_column(pl.col('x').arr.eval(
    pl.arg_where(list_quantile_30 <= pl.element()).first()
).flatten().alias("arg_where"))
shape: (1, 2)
┌────────────────┬───────────┐
│ x              ┆ arg_where │
│ ---            ┆ ---       │
│ list[i64]      ┆ u32       │
╞════════════════╪═══════════╡
│ [0, 2, ... 20] ┆ 3         │
└────────────────┴───────────┘

This convinces me to add a pl.search_sorted in polars as well.

ritchie46
  • 10,405
  • 1
  • 24
  • 43