Edit: Polars >= 0.14.4
As of Polars 0.14.4, the replace
and replace_all
expressions allow an Expression for the value
parameter. Thus, we can solve this more simply as:
df.with_column(
pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ +7 Might ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ │
└────────┴──────────────┴──────────┴─────────────┴────────┘
Polars < 0.14.4
The problem is that the replace
method does not take an Expression, only a constant. Thus, we cannot use a column as replacement values, only a constant.
We can get around this in two ways.
Slow: using apply
This method uses python code to perform the replacement. Since we are executing python bytecode using apply
, it will be slow. If your DataFrame is small, then this won't be too painfully slow.
(
df
.with_column(
pl.struct(['Stat0', 'Amount'])
.apply(lambda cols: cols['Stat0'].replace('$i', cols['Amount']))
.alias('Stat0')
)
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ +7 Might ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ │
└────────┴──────────────┴──────────┴─────────────┴────────┘
Fast: using split_exact
and when/then/otherwise
This method uses all Polars Expressions. As such, it will be much faster, especially for large DataFrames.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
.unnest('Stat0')
.with_column(
pl.when(pl.col('field_1').is_null())
.then(pl.col('field_0'))
.otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
.alias('Stat0')
)
.drop(['field_0', 'field_1'])
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Amount ┆ Stat0 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞════════╪══════════════╪══════════╪════════╪═════════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ ┆ +5 Defense │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ 7 ┆ +7 Might │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ ┆ +9 Vitality │
└────────┴──────────────┴──────────┴────────┴─────────────┘
How it works: we first split the Stat0
column on $i
using split_exact
. This will produce a struct.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬──────────────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ struct[2] ┆ str │
╞════════╪══════════════╪══════════╪══════════════════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ {"+5 Defense",null} ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ {"+"," Might"} ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ {"+9 Vitality",null} ┆ │
└────────┴──────────────┴──────────┴──────────────────────┴────────┘
Notice that when Stat0
does not contain $i
, the second member of the struct is null
. We'll use this fact to our advantage.
In the next step, we break the struct into separate columns, using unnest
.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
.unnest('Stat0')
)
shape: (3, 6)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0 ┆ field_1 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ null ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ + ┆ Might ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ null ┆ │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┘
This creates two new columns: field_0
and field_1
.
From here, we use when/then/otherwise
and concat_str
to construct the final result
Basically:
- when
$i
does not appear in the Stat0
column, then the string is not split, and field_1
is null
, so we can use the value in field_0
as is.
- when
$i
does appear in Stat0
, then the string is split into two parts: field_0
and field_1
. We simply concatenate the parts back together, putting Amount
in the middle.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
.unnest('Stat0')
.with_column(
pl.when(pl.col('field_1').is_null())
.then(pl.col('field_0'))
.otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
.alias('Stat0')
)
)
shape: (3, 7)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0 ┆ field_1 ┆ Amount ┆ Stat0 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╪═════════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ null ┆ ┆ +5 Defense │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ + ┆ Might ┆ 7 ┆ +7 Might │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ null ┆ ┆ +9 Vitality │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┴─────────────┘