0

Can somebody help me with the preferred way to set a categorical value for some rows of a polars data frame (based on a condition)?

Right now I came up with a solution that works by splitting the original data frame in two parts (condition==True and condition==False). I set the categorical value on the first part and concatenate them together again.

┌────────┬──────┐
│ column ┆ more │
│ ---    ┆ ---  │
│ cat    ┆ i32  │
╞════════╪══════╡
│ a      ┆ 1    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b      ┆ 5    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ e      ┆ 9    │ <- I want to set column to 'b' for all rows where it is 'e'
└────────┴──────┘
import polars as pl
df = pl.DataFrame(data={'column': ['a', 'b', 'e'], 'values': [1, 5, 9]}, columns=[('column', pl.Categorical), ('more', pl.Int32)])

print(df)

b_cat_value = df.filter(pl.col('column')=='b')['column'].unique()

df_e_replaced_with_b = df.filter(pl.col('column')=='e').with_column(b_cat_value.alias('column'))
df_no_e = df.filter(pl.col('column')!='e')

print(pl.concat([df_no_e, df_e_replaced_with_b]))

Output is as expected:

┌────────┬──────┐
│ column ┆ more │
│ ---    ┆ ---  │
│ cat    ┆ i32  │
╞════════╪══════╡
│ a      ┆ 1    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b      ┆ 5    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b      ┆ 9    │ <- column has been set to 'b'
└────────┴──────┘

Is there something more straight forward/canonical to get the b_cat_value , like something similar to df['column'].dtype['b']?

And how would I use this in a conditional expression without splitting the data frame apart as in the above example? Something along the lines of...

df.with_column(
    pl.when(pl.col('column') == 'e').then(b_cat_value).otherwise(pl.col('column'))
)
datenzauber.ai
  • 379
  • 2
  • 11
  • Could you explain what your goal is? How would the result look like? – ritchie46 May 11 '22 at 08:00
  • Sure, thanks for the help. I tried to make the question clearer. My original intent is to duplicate the rows where column=='e' and replace them with one version where it is 'a' and one where it is 'b', but that seems straight forward to me, once I know how to set the categorical value for a subset of rows in a canonically way. So I left the duplication part out. – datenzauber.ai May 11 '22 at 10:51
  • This PR https://github.com/pola-rs/polars/pull/3370 ensures that we maintain the categorical type in a `when -> then -> otherwise` – ritchie46 May 11 '22 at 18:28

1 Answers1

1

As of polars>=0.13.33 you can simply set a categorical value with a string and the Categorical dtype will be maintained.

So in this case:

df.with_column(
    pl.when(pl.col("column") == "e").then("b").otherwise(pl.col("column"))
)
ritchie46
  • 10,405
  • 1
  • 24
  • 43