0

Problem

Is there some nice/efficient/best way how to get a dict of polars expressions being applied (and evaluated) on a dataframe (given a column value for match and same+other column values as a part of the expression evaluation)?

Setup

import polars as pl
pl.Config.set_fmt_str_lengths(100)

# base data
df = pl.DataFrame(
    {
        "a":[1,2,3], 
        "b":[2,3,4]
    }
)

# dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
dct = {
    1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")), 
    2: pl.format("Another message having a={}'", pl.col("a"))
}

What I want to get to

exp_df = pl.DataFrame(
    {
        "a":[1,2,3], 
        "b":[2,3,4], 
        "message":["My message about a=1 and b=2", "Another message having a=2", "Default message having a=3, b=4"]
    }
)
print(exp_df)
shape: (3, 3)
┌─────┬─────┬─────────────────────────────────┐
│ a   ┆ b   ┆ message                         │
│ --- ┆ --- ┆ ---                             │
│ i64 ┆ i64 ┆ str                             │
╞═════╪═════╪═════════════════════════════════╡
│ 1   ┆ 2   ┆ My message about a=1 and b=2    │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 3   ┆ Another message having a=2      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ Default message having a=3, b=4 │
└─────┴─────┴─────────────────────────────────┘

What I tried

df_achieved = df.with_columns(
    [
        pl.col("a").apply(
            lambda value: dct.get(
                value,
                pl.format("Default message having a={}, b={}", pl.col("a"), pl.col("b"))
            )
        ).alias("message")
    ]
)
print(df_achieved)
shape: (3, 3)
┌─────┬─────┬──────────────────────────────────────────────────────────────────────────────────────┐
│ a   ┆ b   ┆ message                                                                              │
│ --- ┆ --- ┆ ---                                                                                  │
│ i64 ┆ i64 ┆ object                                                                               │
╞═════╪═════╪══════════════════════════════════════════════════════════════════════════════════════╡
│ 1   ┆ 2   ┆ Utf8(My message about A ').concat_by([col("a"), Utf8(' and B '), col("b"), Utf8(')]) │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 3   ┆ Utf8(Another message having a=).concat_by([col("a"), Utf8(')])                       │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ Utf8(Default message having a=).concat_by([col("a"), Utf8(, b=), col("b")])          │
└─────┴─────┴──────────────────────────────────────────────────────────────────────────────────────┘

1 Answers1

1

As long as you don't have too many different messages, I suggest converting your dictionary to a when/then/otherwise expression programmatically. (Using apply can be painfully single-threaded and slow.)

For example, let's expand your example slightly:

# base data
df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})

# dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
dct = {
    1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")),
    2: pl.format("Another message having a='{}'", pl.col("a")),
    3: pl.format("Still Another message having a='{}'", pl.col("a")),
}

We can convert the dictionary to a when/then/otherwise expression as follows:

(val_a, msg_expr), *remaining_msgs = dct.items()
when_expr = pl.when(pl.col('a') == val_a).then(msg_expr)
while remaining_msgs:
    (val_a, msg_expr), *remaining_msgs = remaining_msgs
    when_expr = when_expr.when(pl.col('a') == val_a).then(msg_expr)

when_expr = when_expr.otherwise(
    pl.format("Default message having a='{}', b='{}'", pl.col('a'), pl.col("b"))
)

The above code does not significantly slow our performance because the Python bytecode is merely generating the expression; it is not being applied to any data.

Then to run our when/then/otherwise expression, we simply use a with_column context.

(
    df
    .with_column(when_expr.alias('message'))
)
shape: (4, 3)
┌─────┬─────┬─────────────────────────────────────┐
│ a   ┆ b   ┆ message                             │
│ --- ┆ --- ┆ ---                                 │
│ i64 ┆ i64 ┆ str                                 │
╞═════╪═════╪═════════════════════════════════════╡
│ 1   ┆ 2   ┆ My message about A '1' and B '2'    │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 3   ┆ Another message having a='2'        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ Still Another message having a='3'  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 5   ┆ Default message having a='4', b='5' │
└─────┴─────┴─────────────────────────────────────┘

This approach should work as long as you don't have a large number of different messages. In that case, we may need to use partition_by to partition our DataFrame (by a), attach each message separately to each partition, and then concatenate the result at the end.