3

How can I use it in select context, such as df.with_columns?

To be more specific, if I have a polars dataframe with a lot of columns and one of them is called x, how can I do pl.cut on x and append the grouping result into the original dataframe?

Below is what I tried but it does not work:

df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "b": [2, 3, 4, 5, 6], "x": [1, 3, 5, 7, 9]}
df.with_columns(pl.cut(pl.col("x"), bins=[2, 4, 6]))

Thanks so much for your help.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
lebesgue
  • 837
  • 4
  • 13

3 Answers3

4

From the docs, as of 2023-01-25, cut takes a Series and returns a DataFrame. Unlike many/most methods and functions, it doesn't take an expression so you can't use it in a select or with_column(s). To get your desired result you'd have to join it to your original df.

Additionally, it appears that cut doesn't necessarily maintain the same dtypes as the parent series. (This is most certainly a bug) As such you have to cast it back to, in this case, int.

You'd have:

df=df.join(
    pl.cut(df.get_column('x'),bins=[2,4,6]).with_column(pl.col('x').cast(pl.Int64())),
    on='x'
)

shape: (5, 5)
┌─────┬─────┬─────┬─────────────┬─────────────┐
│ a   ┆ b   ┆ x   ┆ break_point ┆ category    │
│ --- ┆ --- ┆ --- ┆ ---         ┆ ---         │
│ i64 ┆ i64 ┆ i64 ┆ f64         ┆ cat         │
╞═════╪═════╪═════╪═════════════╪═════════════╡
│ 1   ┆ 2   ┆ 1   ┆ 2.0         ┆ (-inf, 2.0] │
│ 2   ┆ 3   ┆ 3   ┆ 4.0         ┆ (2.0, 4.0]  │
│ 3   ┆ 4   ┆ 5   ┆ 6.0         ┆ (4.0, 6.0]  │
│ 4   ┆ 5   ┆ 7   ┆ inf         ┆ (6.0, inf]  │
│ 5   ┆ 6   ┆ 9   ┆ inf         ┆ (6.0, inf]  │
└─────┴─────┴─────┴─────────────┴─────────────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • It appears that [`pl.cut()` does not always maintain original order](https://github.com/pola-rs/polars/issues/4286) which is another issue. – jqurious Jan 25 '23 at 15:59
  • @jqurious if we're just joining it back to the original, why does the order matter, perhaps I'm misunderstanding what you mean though. – Dean MacGregor Jan 25 '23 at 17:51
  • No - it's my mistake - it isn't relevant if you're joining as you say, apologies. – jqurious Jan 25 '23 at 18:15
2
df = pl.DataFrame(
    {"a": [1, 2, 3, 4, 5],
     "b": [2, 3, 4, 5, 6],
     "x": [1, 3, 5, 7, 9]}
)

df.with_columns(
    pl.col('x').cut([2, 4, 6]).alias('x_cut')
)
shape: (5, 4)
┌─────┬─────┬─────┬───────────┐
│ a   ┆ b   ┆ x   ┆ x_cut     │
│ --- ┆ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ i64 ┆ cat       │
╞═════╪═════╪═════╪═══════════╡
│ 1   ┆ 2   ┆ 1   ┆ (-inf, 2] │
│ 2   ┆ 3   ┆ 3   ┆ (2, 4]    │
│ 3   ┆ 4   ┆ 5   ┆ (4, 6]    │
│ 4   ┆ 5   ┆ 7   ┆ (6, inf]  │
│ 5   ┆ 6   ┆ 9   ┆ (6, inf]  │
└─────┴─────┴─────┴───────────┘

Old solution

As of 0.16.8, the top-level function pl.cut has been deprecated. You have to use the series method .cut instead now, which returns a three-column DataFrame.

# get x column as a Series and then apply .cut method
df['x'].cut(bins=[2, 4, 6])

It returns a DataFrame like the following:

shape: (5, 3)
┌─────┬─────────────┬─────────────┐
│ x   ┆ break_point ┆ category    │
│ --- ┆ ---         ┆ ---         │
│ f64 ┆ f64         ┆ cat         │
╞═════╪═════════════╪═════════════╡
│ 1.0 ┆ 2.0         ┆ (-inf, 2.0] │
│ 3.0 ┆ 4.0         ┆ (2.0, 4.0]  │
│ 5.0 ┆ 6.0         ┆ (4.0, 6.0]  │
│ 7.0 ┆ inf         ┆ (6.0, inf]  │
│ 9.0 ┆ inf         ┆ (6.0, inf]  │
└─────┴─────────────┴─────────────┘

If you just want to add the cut categories in your main DataFrame. You can do so in a with_columns() directly:

df.with_columns(
    df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category'].alias('x_cut')
)

# or
df.with_columns(
    x_cut=df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category']
)
shape: (5, 4)
┌─────┬─────┬─────┬─────────────┐
│ a   ┆ b   ┆ x   ┆ x_cut       │
│ --- ┆ --- ┆ --- ┆ ---         │
│ i64 ┆ i64 ┆ i64 ┆ cat         │
╞═════╪═════╪═════╪═════════════╡
│ 1   ┆ 2   ┆ 1   ┆ (-inf, 2.0] │
│ 2   ┆ 3   ┆ 3   ┆ (2.0, 4.0]  │
│ 3   ┆ 4   ┆ 5   ┆ (4.0, 6.0]  │
│ 4   ┆ 5   ┆ 7   ┆ (6.0, inf]  │
│ 5   ┆ 6   ┆ 9   ┆ (6.0, inf]  │
└─────┴─────┴─────┴─────────────┘
steven
  • 2,130
  • 19
  • 38
  • As a side question, how can I do ```over``` if using ```cut``` the way you specified? – lebesgue Apr 04 '23 at 18:34
  • @lebesgue I'm not sure if you can combine them at this moment since `.over` is an expr, which has been used in a context like `with_columns`. For now, `.cut` is a series method, so... – steven Apr 04 '23 at 21:03
1

As of 0.18.5 you can use cut as an expression. (Due to the lack of reputation I unfortunately can't post this as a comment to previous replies)

import polars as pl
df = pl.DataFrame({"numbers": range(0, 20, 2)})
(
    df.with_columns(
        pl.col("numbers").cut([4, 7, 15]).alias("bins")
    )
)
Mondo30003
  • 151
  • 3