Polars - Replace part of string in column with value of other column

Question

So I have a Polars dataframe looking as such

df = pl.DataFrame(
    {
        "ItemId": [15148, 15148, 24957],
        "SuffixFactor": [19200, 200, 24],
        "ItemRand": [254, -1, -44],
        "Stat0": ['+5 Defense', '+$i Might', '+9 Vitality'],
        "Amount": ['', '7', '']
    }
)

I want to replace $i in the column "Stat0" with Amount whenever Stat0 contains i$

I have tried a couple different things such as:

df = df.with_column(
    pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)

Expected result

result = pl.DataFrame(
    {
        "ItemId": [15148, 15148, 24957],
        "SuffixFactor": [19200, 200, 24],
        "ItemRand": [254, -1, -44],
        "Stat0": ['+5 Defense', '+7 Might', '+9 Vitality'],
        "Amount": ['', '7', '']
    }
)

But this doesn't seem to work.

I hope someone can help.

Best regards

can u pls add what is your expected output as `df` ? – Bhargav - Retarded Skills Aug 20 '22 at 13:36 — Bhargav - Retarded Skills, Aug 20 '22 at 13:36
Oh sorry, yes I have made an edit now :) – Shamatix Aug 20 '22 at 13:57 — Shamatix, Aug 20 '22 at 13:57

score 9 · Accepted Answer · 2022-08-22T18:08:20.580

Edit: Polars >= 0.14.4

As of Polars 0.14.4, the replace and replace_all expressions allow an Expression for the value parameter. Thus, we can solve this more simply as:

df.with_column(
    pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)

shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0       ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +7 Might    ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆        │
└────────┴──────────────┴──────────┴─────────────┴────────┘

Polars < 0.14.4

The problem is that the replace method does not take an Expression, only a constant. Thus, we cannot use a column as replacement values, only a constant.

We can get around this in two ways.

Slow: using `apply`

This method uses python code to perform the replacement. Since we are executing python bytecode using apply, it will be slow. If your DataFrame is small, then this won't be too painfully slow.

(
    df
    .with_column(
        pl.struct(['Stat0', 'Amount'])
        .apply(lambda cols: cols['Stat0'].replace('$i', cols['Amount']))
        .alias('Stat0')
    )
)

shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0       ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +7 Might    ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆        │
└────────┴──────────────┴──────────┴─────────────┴────────┘

Fast: using `split_exact` and `when/then/otherwise`

This method uses all Polars Expressions. As such, it will be much faster, especially for large DataFrames.

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
    .with_column(
        pl.when(pl.col('field_1').is_null())
        .then(pl.col('field_0'))
        .otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
        .alias('Stat0')
    )
    .drop(['field_0', 'field_1'])
)

shape: (3, 5)
┌────────┬──────────────┬──────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Amount ┆ Stat0       │
│ ---    ┆ ---          ┆ ---      ┆ ---    ┆ ---         │
│ i64    ┆ i64          ┆ i64      ┆ str    ┆ str         │
╞════════╪══════════════╪══════════╪════════╪═════════════╡
│ 15148  ┆ 19200        ┆ 254      ┆        ┆ +5 Defense  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ 7      ┆ +7 Might    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆        ┆ +9 Vitality │
└────────┴──────────────┴──────────┴────────┴─────────────┘

How it works: we first split the Stat0 column on $i using split_exact. This will produce a struct.

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
)

shape: (3, 5)
┌────────┬──────────────┬──────────┬──────────────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0                ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---                  ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ struct[2]            ┆ str    │
╞════════╪══════════════╪══════════╪══════════════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ {"+5 Defense",null}  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ {"+"," Might"}       ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ {"+9 Vitality",null} ┆        │
└────────┴──────────────┴──────────┴──────────────────────┴────────┘

Notice that when Stat0 does not contain $i, the second member of the struct is null. We'll use this fact to our advantage.

In the next step, we break the struct into separate columns, using unnest.

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
)

shape: (3, 6)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0     ┆ field_1 ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---     ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str     ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆ null    ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +           ┆  Might  ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆ null    ┆        │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┘

This creates two new columns: field_0 and field_1.

From here, we use when/then/otherwise and concat_str to construct the final result

Basically:

when $i does not appear in the Stat0 column, then the string is not split, and field_1 is null, so we can use the value in field_0 as is.
when $i does appear in Stat0, then the string is split into two parts: field_0 and field_1. We simply concatenate the parts back together, putting Amount in the middle.

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
    .with_column(
        pl.when(pl.col('field_1').is_null())
        .then(pl.col('field_0'))
        .otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
        .alias('Stat0')
    )
)

shape: (3, 7)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0     ┆ field_1 ┆ Amount ┆ Stat0       │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---     ┆ ---    ┆ ---         │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str     ┆ str    ┆ str         │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╪═════════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆ null    ┆        ┆ +5 Defense  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +           ┆  Might  ┆ 7      ┆ +7 Might    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆ null    ┆        ┆ +9 Vitality │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┴─────────────┘

Damn!!...Lot work!!...You deserve an upvote – Bhargav - Retarded Skills Aug 20 '22 at 14:20 — Bhargav - Retarded Skills, Aug 20 '22 at 14:20
`df.with_columns(...)` apparently. – JL Peyret May 24 '23 at 23:56 — JL Peyret, May 24 '23 at 23:56

Polars - Replace part of string in column with value of other column

1 Answers1

Edit: Polars >= 0.14.4

Polars < 0.14.4

Slow: using apply

Fast: using split_exact and when/then/otherwise

Slow: using `apply`

Fast: using `split_exact` and `when/then/otherwise`