6

So I have a Polars dataframe looking as such

df = pl.DataFrame(
    {
        "ItemId": [15148, 15148, 24957],
        "SuffixFactor": [19200, 200, 24],
        "ItemRand": [254, -1, -44],
        "Stat0": ['+5 Defense', '+$i Might', '+9 Vitality'],
        "Amount": ['', '7', '']
    }
)

I want to replace $i in the column "Stat0" with Amount whenever Stat0 contains i$

I have tried a couple different things such as:

df = df.with_column(
    pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)

Expected result

result = pl.DataFrame(
    {
        "ItemId": [15148, 15148, 24957],
        "SuffixFactor": [19200, 200, 24],
        "ItemRand": [254, -1, -44],
        "Stat0": ['+5 Defense', '+7 Might', '+9 Vitality'],
        "Amount": ['', '7', '']
    }
)

But this doesn't seem to work.

I hope someone can help.

Best regards

Shamatix
  • 77
  • 1
  • 6

1 Answers1

9

Edit: Polars >= 0.14.4

As of Polars 0.14.4, the replace and replace_all expressions allow an Expression for the value parameter. Thus, we can solve this more simply as:

df.with_column(
    pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0       ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +7 Might    ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆        │
└────────┴──────────────┴──────────┴─────────────┴────────┘

Polars < 0.14.4

The problem is that the replace method does not take an Expression, only a constant. Thus, we cannot use a column as replacement values, only a constant.

We can get around this in two ways.

Slow: using apply

This method uses python code to perform the replacement. Since we are executing python bytecode using apply, it will be slow. If your DataFrame is small, then this won't be too painfully slow.

(
    df
    .with_column(
        pl.struct(['Stat0', 'Amount'])
        .apply(lambda cols: cols['Stat0'].replace('$i', cols['Amount']))
        .alias('Stat0')
    )
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0       ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +7 Might    ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆        │
└────────┴──────────────┴──────────┴─────────────┴────────┘

Fast: using split_exact and when/then/otherwise

This method uses all Polars Expressions. As such, it will be much faster, especially for large DataFrames.

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
    .with_column(
        pl.when(pl.col('field_1').is_null())
        .then(pl.col('field_0'))
        .otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
        .alias('Stat0')
    )
    .drop(['field_0', 'field_1'])
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Amount ┆ Stat0       │
│ ---    ┆ ---          ┆ ---      ┆ ---    ┆ ---         │
│ i64    ┆ i64          ┆ i64      ┆ str    ┆ str         │
╞════════╪══════════════╪══════════╪════════╪═════════════╡
│ 15148  ┆ 19200        ┆ 254      ┆        ┆ +5 Defense  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ 7      ┆ +7 Might    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆        ┆ +9 Vitality │
└────────┴──────────────┴──────────┴────────┴─────────────┘

How it works: we first split the Stat0 column on $i using split_exact. This will produce a struct.

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬──────────────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0                ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---                  ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ struct[2]            ┆ str    │
╞════════╪══════════════╪══════════╪══════════════════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ {"+5 Defense",null}  ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ {"+"," Might"}       ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ {"+9 Vitality",null} ┆        │
└────────┴──────────────┴──────────┴──────────────────────┴────────┘

Notice that when Stat0 does not contain $i, the second member of the struct is null. We'll use this fact to our advantage.

In the next step, we break the struct into separate columns, using unnest.

(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
)
shape: (3, 6)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0     ┆ field_1 ┆ Amount │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---     ┆ ---    │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str     ┆ str    │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆ null    ┆        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +           ┆  Might  ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆ null    ┆        │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┘

This creates two new columns: field_0 and field_1.

From here, we use when/then/otherwise and concat_str to construct the final result

Basically:

  • when $i does not appear in the Stat0 column, then the string is not split, and field_1 is null, so we can use the value in field_0 as is.
  • when $i does appear in Stat0, then the string is split into two parts: field_0 and field_1. We simply concatenate the parts back together, putting Amount in the middle.
(
    df
    .with_column(
        pl.col('Stat0').str.split_exact('$i', 1)
    )
    .unnest('Stat0')
    .with_column(
        pl.when(pl.col('field_1').is_null())
        .then(pl.col('field_0'))
        .otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
        .alias('Stat0')
    )
)
shape: (3, 7)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0     ┆ field_1 ┆ Amount ┆ Stat0       │
│ ---    ┆ ---          ┆ ---      ┆ ---         ┆ ---     ┆ ---    ┆ ---         │
│ i64    ┆ i64          ┆ i64      ┆ str         ┆ str     ┆ str    ┆ str         │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╪═════════════╡
│ 15148  ┆ 19200        ┆ 254      ┆ +5 Defense  ┆ null    ┆        ┆ +5 Defense  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148  ┆ 200          ┆ -1       ┆ +           ┆  Might  ┆ 7      ┆ +7 Might    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957  ┆ 24           ┆ -44      ┆ +9 Vitality ┆ null    ┆        ┆ +9 Vitality │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┴─────────────┘