How to map a dict of expressions to a dataframe

Question

Problem

Is there some nice/efficient/best way how to get a dict of polars expressions being applied (and evaluated) on a dataframe (given a column value for match and same+other column values as a part of the expression evaluation)?

Setup

import polars as pl
pl.Config.set_fmt_str_lengths(100)

# base data
df = pl.DataFrame(
    {
        "a":[1,2,3], 
        "b":[2,3,4]
    }
)

# dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
dct = {
    1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")), 
    2: pl.format("Another message having a={}'", pl.col("a"))
}

What I want to get to

exp_df = pl.DataFrame(
    {
        "a":[1,2,3], 
        "b":[2,3,4], 
        "message":["My message about a=1 and b=2", "Another message having a=2", "Default message having a=3, b=4"]
    }
)

print(exp_df)
shape: (3, 3)
┌─────┬─────┬─────────────────────────────────┐
│ a   ┆ b   ┆ message                         │
│ --- ┆ --- ┆ ---                             │
│ i64 ┆ i64 ┆ str                             │
╞═════╪═════╪═════════════════════════════════╡
│ 1   ┆ 2   ┆ My message about a=1 and b=2    │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 3   ┆ Another message having a=2      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ Default message having a=3, b=4 │
└─────┴─────┴─────────────────────────────────┘

What I tried

df_achieved = df.with_columns(
    [
        pl.col("a").apply(
            lambda value: dct.get(
                value,
                pl.format("Default message having a={}, b={}", pl.col("a"), pl.col("b"))
            )
        ).alias("message")
    ]
)

print(df_achieved)
shape: (3, 3)
┌─────┬─────┬──────────────────────────────────────────────────────────────────────────────────────┐
│ a   ┆ b   ┆ message                                                                              │
│ --- ┆ --- ┆ ---                                                                                  │
│ i64 ┆ i64 ┆ object                                                                               │
╞═════╪═════╪══════════════════════════════════════════════════════════════════════════════════════╡
│ 1   ┆ 2   ┆ Utf8(My message about A ').concat_by([col("a"), Utf8(' and B '), col("b"), Utf8(')]) │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 3   ┆ Utf8(Another message having a=).concat_by([col("a"), Utf8(')])                       │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ Utf8(Default message having a=).concat_by([col("a"), Utf8(, b=), col("b")])          │
└─────┴─────┴──────────────────────────────────────────────────────────────────────────────────────┘

score 1 · Accepted Answer · 2022-09-13T17:42:10.867

As long as you don't have too many different messages, I suggest converting your dictionary to a when/then/otherwise expression programmatically. (Using apply can be painfully single-threaded and slow.)

For example, let's expand your example slightly:

# base data
df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})

# dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
dct = {
    1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")),
    2: pl.format("Another message having a='{}'", pl.col("a")),
    3: pl.format("Still Another message having a='{}'", pl.col("a")),
}

We can convert the dictionary to a when/then/otherwise expression as follows:

(val_a, msg_expr), *remaining_msgs = dct.items()
when_expr = pl.when(pl.col('a') == val_a).then(msg_expr)
while remaining_msgs:
    (val_a, msg_expr), *remaining_msgs = remaining_msgs
    when_expr = when_expr.when(pl.col('a') == val_a).then(msg_expr)

when_expr = when_expr.otherwise(
    pl.format("Default message having a='{}', b='{}'", pl.col('a'), pl.col("b"))
)

The above code does not significantly slow our performance because the Python bytecode is merely generating the expression; it is not being applied to any data.

Then to run our when/then/otherwise expression, we simply use a with_column context.

(
    df
    .with_column(when_expr.alias('message'))
)

shape: (4, 3)
┌─────┬─────┬─────────────────────────────────────┐
│ a   ┆ b   ┆ message                             │
│ --- ┆ --- ┆ ---                                 │
│ i64 ┆ i64 ┆ str                                 │
╞═════╪═════╪═════════════════════════════════════╡
│ 1   ┆ 2   ┆ My message about A '1' and B '2'    │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 3   ┆ Another message having a='2'        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ Still Another message having a='3'  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 5   ┆ Default message having a='4', b='5' │
└─────┴─────┴─────────────────────────────────────┘

This approach should work as long as you don't have a large number of different messages. In that case, we may need to use partition_by to partition our DataFrame (by a), attach each message separately to each partition, and then concatenate the result at the end.

That looks like an optimal solution to my use case. Thank you! — Michal Chromčák, Sep 14 '22 at 18:19

How to map a dict of expressions to a dataframe

1 Answers1

Linked