1

In Pandas, you can create an "ordered" Categorical column from existing string column as follows:

column_values_with_custom_order = ["B", "A", "C"] df["Column"] = pd.Categorical(df.Column, categories=column_values_with_custom_order, ordered=True)

In Polars documentation, I couldn't find such way to create ordered columns. However, I could reproduce this by using pl.from_pandas(df) so I suspect that this is possible with Polars as well.

What would be the recommended way to this?

I tried to create new column with polars_df.with_columns(col("Column").cast(pl.categorical)), but I don't know how to include the custom ordering to this.

I also checked In polars, can I create a categorical type with levels myself?, but I would prefer not to add another column to my Dataframe only for ordering.

Eero H
  • 33
  • 7
  • Note that in the linked answer, another column is not added to the DataFrame … merely that a small Series is created with the desired ordering while the StringCache is in effect. The purpose of that initial Series is to set the order of the strings and nothing more. It can even be discarded, and never added to any DataFrame. Then, as long as the same StringCache remains in effect, any subsequent Categorical columns in any DatFrame will respect the order from that initial Series, even if the Series was discarded. – ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ Feb 09 '23 at 12:58
  • I notice it now that the linked answer, in fact, contains the solution to my question. Thanks for noting that. – Eero H Feb 09 '23 at 13:45

2 Answers2

1

From the doc: Use:

polars_df.with_columns(col("Column").cast(pl.categorical).cat.set_ordering("lexical"))

See the doc

df = pl.DataFrame(
    {"cats": ["z", "z", "k", "a", "b"], "vals": [3, 1, 2, 2, 3]}
).with_columns(
    [
        pl.col("cats").cast(pl.Categorical).cat.set_ordering("lexical"),
    ]
)
df.sort(["cats", "vals"])
0x26res
  • 11,925
  • 11
  • 54
  • 108
  • I believe this is just ordering by alphabetical order? What if I wanted to define the order as ["k", "z", "b", "a"]. I don't think this would work then. Result from the code: shape: (5, 2) cats vals cat i64 "a" 2 "b" 3 "k" 2 "z" 1 "z" 3 – Eero H Feb 09 '23 at 11:46
  • I guess you can do `set_ordering("physical")` but the categories have to appear in the order you want them to be. – 0x26res Feb 09 '23 at 12:48
1

Say you have

df = pl.DataFrame(
     {"cats": ["z", "z", "k", "a", "b"], "vals": [3, 1, 2, 2, 3]}
     )

and you want to make cats a categorical but you want the categorical ordered as

myorder=["k", "z", "b", "a"]

There are two ways to do this. One way is with pl.StringCache() as in the question you reference and the other is more messy. The former does not require you add any columns to your df. It's actually very succinct.

with pl.StringCache():
    pl.Series(myorder).cast(pl.Categorical)
    df=df.with_columns(pl.col('cats').cast(pl.Categorical))

What happens is that everything in the StringCache gets the same key values so when the myorder list is casted that saves what keys get allocated to each string value. When your df gets casted under the same cache it gets the same key/string values which are in the order you wanted.

The other way to do this is as follows:

You have to sort your df by the ordering then you can do set_ordering('physical'). If you want to maintain your original order then you just have to use with_row_count at the beginning so you can restore that order.

Putting it all together, it looks like this:

df=df.with_row_count('i').join(
        pl.from_dicts([{'order':x, 'cats':y} for x,y in enumerate(myorder)]), on='cats') \
    .sort('order').drop('order') \
    .with_columns(pl.col('cats').cast(pl.Categorical).cat.set_ordering('physical')) \
    .sort('i').drop('i')

You can verify by doing:

df.select(['cats',pl.col('cats').to_physical().alias('phys')])

shape: (5, 2)
┌──────┬──────┐
│ cats ┆ phys │
│ ---  ┆ ---  │
│ cat  ┆ u32  │
╞══════╪══════╡
│ z    ┆ 1    │
│ z    ┆ 1    │
│ k    ┆ 0    │
│ a    ┆ 3    │
│ b    ┆ 2    │
└──────┴──────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72