2

I have a data set with three columns. Column A is to be checked for strings. If the string matches foo or spam, the values in the same row for the other two columns L and G should be changed to XX. For this I have tried the following.

df = pl.DataFrame(
    {
        "A": ["foo", "ham", "spam", "egg",],
        "L": ["A54", "A12", "B84", "C12"],
        "G": ["X34", "C84", "G96", "L6",],
    }
)
print(df)

shape: (4, 3)
┌──────┬─────┬─────┐
│ A    ┆ L   ┆ G   │
│ ---  ┆ --- ┆ --- │
│ str  ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo  ┆ A54 ┆ X34 │
│ ham  ┆ A12 ┆ C84 │
│ spam ┆ B84 ┆ G96 │
│ egg  ┆ C12 ┆ L6  │
└──────┴─────┴─────┘

expected outcome

shape: (4, 3)
┌──────┬─────┬─────┐
│ A    ┆ L   ┆ G   │
│ ---  ┆ --- ┆ --- │
│ str  ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo  ┆ XX  ┆ XX  │
│ ham  ┆ A12 ┆ C84 │
│ spam ┆ XX  ┆ XX  │
│ egg  ┆ C12 ┆ L6  │
└──────┴─────┴─────┘

I tried this

df = df.with_column(
    pl.when((pl.col("A") == "foo") | (pl.col("A") == "spam"))
    .then((pl.col("L")= "XX") & (pl.col( "G")= "XX"))
    .otherwise((pl.col("L"))&(pl.col( "G")))
)

However, this does not work. Can someone help me with this?

Horseman
  • 297
  • 1
  • 14

1 Answers1

4

For setting multiple columns to the same value you could use:

df.with_columns(
   pl.when(pl.col("A").is_in(["foo", "spam"]))
     .then(pl.lit("XX"))
     .otherwise(pl.col("L", "G"))
     .keep_name()
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A    ┆ L   ┆ G   │
│ ---  ┆ --- ┆ --- │
│ str  ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo  ┆ XX  ┆ XX  │
│ ham  ┆ A12 ┆ C84 │
│ spam ┆ XX  ┆ XX  │
│ egg  ┆ C12 ┆ L6  │
└──────┴─────┴─────┘

.is_in() can be used instead of multiple == x | == y chains.

To update multiple columns at once with different values you could use .map() and a dictionary:

df.with_columns(
   pl.when(pl.col("A").is_in(["foo", "spam"]))
     .then(pl.col("L", "G").map(
        lambda col: {
           "L": "XX",
           "G": "YY",
        }.get(col.name)))
     .otherwise(pl.col("L", "G"))
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A    ┆ L   ┆ G   │
│ ---  ┆ --- ┆ --- │
│ str  ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo  ┆ XX  ┆ YY  │
│ ham  ┆ A12 ┆ C84 │
│ spam ┆ XX  ┆ YY  │
│ egg  ┆ C12 ┆ L6  │
└──────┴─────┴─────┘

When using .map the whole column is passed (as a pl.Series) meaning you can also manipulate it if desired:

df.with_columns(
   pl.when(pl.col("A").is_in(["foo", "spam"]))
     .then(pl.col("L", "G").map(
        lambda col: {
           "L": col.str.to_lowercase(),
           "G": "prefix_" + col
        }.get(col.name)))
     .otherwise(pl.col("L", "G"))
)
shape: (4, 3)
┌──────┬─────┬────────────┐
│ A    ┆ L   ┆ G          │
│ ---  ┆ --- ┆ ---        │
│ str  ┆ str ┆ str        │
╞══════╪═════╪════════════╡
│ foo  ┆ a54 ┆ prefix_X34 │
│ ham  ┆ A12 ┆ C84        │
│ spam ┆ b84 ┆ prefix_G96 │
│ egg  ┆ C12 ┆ L6         │
└──────┴─────┴────────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • 1
    How can you do such an operation if you need to access the column value inside the lambda ? For exemple : `lambda col: {"L": pl.col("..."), "G": pl.col("...")}.get(col.name)` – mlisthenewcool Aug 07 '23 at 12:19
  • @mlisthenewcool `col` is the column (pl.Series), so you can e.g. `{"L": col.str.to_uppercase(), ...}` – jqurious Aug 07 '23 at 12:24