0

I would like to be able to clip numerical values in a DataFrame based on the result of an expression on that DataFrame. However, the clip function only accepts floats or ints, not expr.

Given the following:

df = pl.DataFrame({'x': [0, 1,2,3,4,5,6,7,8,9,10]})

How would I best clip all values to between the 20th and 80th percentile? I tried the built-in clip function first:

df.with_column(
    pl.col("x").clip(
        min_val = pl.col("x").quantile(0.20),
        max_val = pl.col("x").quantile(0.80)
   )
   .alias("clipped")
)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'RuntimeError'>, value: RuntimeError('BindingsError: "row type not supported <polars.internals.expr.expr.Expr object at 0x0000016F4B3053C0>"'), traceback: None }', src\lazy\dsl.rs:351:53
Traceback (most recent call last):
  File "C:\Users\BWT\Anaconda3\envs\tca_ml\lib\site-packages\IPython\core\interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-240f333898af>", line 2, in <cell line: 1>
    pl.col("x").clip(
  File "C:\Users\BWT\Anaconda3\envs\tca_ml\lib\site-packages\polars\internals\expr\expr.py", line 4840, in clip
    return wrap_expr(self._pyexpr.clip(min_val, max_val))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'RuntimeError'>, value: RuntimeError('BindingsError: "row type not supported <polars.internals.expr.expr.Expr object at 0x0000016F4B3053C0>"'), traceback: None }

The following works and yields the expected results, but is rather ugly and in my opinion:

>>> lower = pl.col("x").quantile(0.20)
>>> upper = pl.col("x").quantile(0.80)
>>> df.with_columns(
    [
        pl.when(pl.col("x") < lower)
        .then(lower))
        .when(pl.col("x") > upper)
        .then(upper)
        .otherwise(pl.col("x"))
        .alias("clipped")
    ]
)

Out[31]: 
shape: (11, 2)
┌─────┬─────────┐
│ x   ┆ clipped │
│ --- ┆ ---     │
│ i64 ┆ f64     │
╞═════╪═════════╡
│ 0   ┆ 2.0     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ 2.0     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2.0     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 3.0     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 7   ┆ 7.0     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8   ┆ 8.0     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 9   ┆ 8.0     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10  ┆ 8.0     │
└─────┴─────────┘

What would be the best way to do this without making it overly verbose?

  • 1
    Passing in an expression is indeed not supported. I noticed there was an old request, re-opened and added a link to here: https://github.com/pola-rs/polars/issues/2990. A workaround is to compute the quantiles using the eager api: ``` lower = df["x"].quantile(0.2); upper = df["x"].quantile(0.8); df.with_column(pl.col("x").clip(lower,upper).alias("clipped"))``` – jvz Aug 29 '22 at 14:21

0 Answers0