0

In pandas, one can use logical indexing to assign items:

s = pd.Series(['a', 'b', 'c', 'd', 'e'])
idx = [True, False, True, False, True]
s[idx] = ['x', 'y', 'z']

In polars, we can do this with set_at_idx:

s = pl.Series(['a', 'b', 'c', 'd', 'e'])
idx = pl.Series([True, False, True, False, True])
s = s.set_at_idx(idx.arg_true(), ['x', 'y', 'z']) # arg_true() gives us integer indices

However, I'm struggling to figure out how to do this in an expression context, as I cannot do df['col'] = df['col'].set_at_idx.... Instead, polars suggests I use with_column:

import polars as pl
from polars import col, when

df = pl.DataFrame({
    "my_col": ['a', 'b', 'c', 'd', 'e'],
    "idx": [True, False, True, False, True]
})

new_values = ['x', 'y', 'z']
df = df.with_column(
    when(col("idx")).then(new_values) # how to do this?
    .otherwise(col("my_col"))
)

My method of getting around this is somewhat long-winded and there must be an easier way:

s = df["my_col"].clone()
s = s.set_at_idx(df["idx"].arg_true(), new_values)
df = df.with_column(s.alias("my_col"))

Syntactically it's not horrible, but is there an easier way to simply update a series with a list of values (or other series)?

NedDasty
  • 192
  • 1
  • 8

1 Answers1

2

I'm not aware of an elegant way to convert your code using Series directly to Expressions. But we won't let that stop us.

One underappreciated aspect of the Polars architecture is the ability to compose our own Expressions using existing Polars API Expressions. (Perhaps there should be a section in the User Guide for this.)

So, let's create an Expression to do what we need.

The code below may seem overwhelming at first. We'll look at examples and I'll explain how it works in detail below.

Expression: set_by_mask

Here's a custom Expression that will set values based on a mask. For lack of a better name I've called it set_by_mask. The Expression is a bit rough (e.g., it does zero error-checking), but it should act as a good starting point.

Notice at the end that we will assign this function as a method of the Expr class, so that it can be used just like any other Expression (e.g., it can participate in any valid chain of Expressions, be used within a groupby, etc..)

Much of the code below deals with "convenience" methods (e.g., allowing the mask parameters to be a list/tuple or an Expression or a Series). Later, we'll go through how the heart of the algorithm works.

Here's the code:

from typing import Any, Sequence
import polars.internals as pli


def set_by_mask(
    self: pli.Expr,
    mask: str | Sequence[bool] | pli.Series | pli.Expr,
    values: Sequence[Any] | pli.Series | pli.Expr,
) -> pli.Expr:
    """
    Set values at mask locations.

    Parameters
    ----------
    mask
        Indices with True values are replaced with values.

        Sequence[bool]: list or tuple of boolean values
        str: column name of boolean Expression
        Series | Expr: Series or Expression that evaluates to boolean

    values
        Values to replace where mask is True

    Notes:
        The number of elements in values must match the number of
        True values in mask.

        The mask Expression/list/tuple must match the length of the
        Expression for which values are being set.

    """

    if isinstance(mask, str):
        mask = pl.col(mask)

    if isinstance(mask, Sequence):
        mask = pli.Series("", mask)

    if isinstance(values, Sequence):
        values = pli.Series("", values)

    if isinstance(mask, pli.Series):
        mask = pli.lit(mask)

    if isinstance(values, pli.Series):
        values = pli.lit(values)

    result = (
        self
        .sort_by(mask)
        .slice(0, mask.is_not().sum())
        .append(values)
        .sort_by(mask.arg_sort())
    )

    return self._from_pyexpr(result._pyexpr)


pli.Expr.set_by_mask = set_by_mask

Examples

First, let's run through some examples of how this works.

In the example below, we'll pass a string as our mask parameter -- indicating the column name of df to be used as a mask. And we'll pass a simple Python list of string values as our values parameter.

Remember to run the code above first before running the examples below. We need the set_by_mask function to be a method of the Expr class. (Don't worry, it's not permanent - when the Python interpreter exits, the Expr class will be restored to its original state.)

import polars as pl

df = pl.DataFrame({
    "my_col": ["a", "b", "c", "d", "e"],
    "idx": [True, False, True, False, True],
})

new_values = ("x", "y", "z")
(
    df
    .with_columns([
        pl.col('my_col').set_by_mask("idx", new_values).alias('result')
    ])
)

shape: (5, 3)
┌────────┬───────┬────────┐
│ my_col ┆ idx   ┆ result │
│ ---    ┆ ---   ┆ ---    │
│ str    ┆ bool  ┆ str    │
╞════════╪═══════╪════════╡
│ a      ┆ true  ┆ x      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ false ┆ b      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c      ┆ true  ┆ y      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ d      ┆ false ┆ d      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ e      ┆ true  ┆ z      │
└────────┴───────┴────────┘

We see that the values for a, c, and e have been replaced, consistent with where the mask evaluated to True.

As another example, let's pass the mask and values parameter as external Series (ignoring the idx column).

new_values = pl.Series("", ['1', '2', '4', '5'])
mask = pl.Series("", [True, True, False, True, True])
(
    df
    .with_columns([
        pl.col('my_col').set_by_mask(mask, new_values).alias('result')
    ])
)
shape: (5, 3)
┌────────┬───────┬────────┐
│ my_col ┆ idx   ┆ result │
│ ---    ┆ ---   ┆ ---    │
│ str    ┆ bool  ┆ str    │
╞════════╪═══════╪════════╡
│ a      ┆ true  ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ false ┆ 2      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c      ┆ true  ┆ c      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ d      ┆ false ┆ 4      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ e      ┆ true  ┆ 5      │
└────────┴───────┴────────┘

How it works

The heart of the algorithm is this snippet.

    result = (
        self
        .sort_by(mask)
        .slice(0, mask.is_not().sum())
        .append(values)
        .sort_by(mask.arg_sort())
    )

To see how it works, we'll use the first example and watch how the algorithm builds the answer.

First, we'll need to change the with_columns in your query to a select, because the intermediate steps in the algorithm won't produce a column whose length matches the other columns, which will lead to an error.

Here's the code that we'll run to observe the steps in the algorithm

import polars as pl
new_values = ("x", "y", "z")
(
    pl.DataFrame({
        "my_col": ["a", "b", "c", "d", "e"],
        "idx": [True, False, True, False, True],
    })
    .select([
        pl.col('my_col').set_by_mask("idx", new_values).alias('result')
    ])
)
shape: (5, 1)
┌────────┐
│ result │
│ ---    │
│ str    │
╞════════╡
│ x      │
├╌╌╌╌╌╌╌╌┤
│ b      │
├╌╌╌╌╌╌╌╌┤
│ y      │
├╌╌╌╌╌╌╌╌┤
│ d      │
├╌╌╌╌╌╌╌╌┤
│ z      │
└────────┘

With that in place, let's look at how the algorithm evolves.

The Algorithm in steps

The first step of the algorithm is to sort the original column so that non-changing values (those corresponding to a mask value of False) are sorted to the top. We'll accomplish this using the sort_by Expression, and pass mask as our sorting criterion.

I'll change the heart of the algorithm to only these steps.

    result = (
        self
        .sort_by(mask)
    )

Here's the result.

shape: (5, 1)
┌────────┐
│ result │
│ ---    │
│ str    │
╞════════╡
│ b      │
├╌╌╌╌╌╌╌╌┤
│ d      │
├╌╌╌╌╌╌╌╌┤
│ a      │
├╌╌╌╌╌╌╌╌┤
│ c      │
├╌╌╌╌╌╌╌╌┤
│ e      │
└────────┘

In our example, values b and d are not changing and are sorted to the top; values a, c, and e are being replaced and are sorted to the bottom.

In the next step, we'll use the slice Expression to eliminate those values that will be replaced.

    result = (
        self
        .sort_by(mask)
        .slice(0, mask.is_not().sum())
    )
shape: (2, 1)
┌────────┐
│ result │
│ ---    │
│ str    │
╞════════╡
│ b      │
├╌╌╌╌╌╌╌╌┤
│ d      │
└────────┘

In the next step, we'll use the append Expression to place the new values at the bottom.

    result = (
        self
        .sort_by(mask)
        .slice(0, mask.is_not().sum())
        .append(values)
    )
shape: (5, 1)
┌────────┐
│ result │
│ ---    │
│ str    │
╞════════╡
│ b      │
├╌╌╌╌╌╌╌╌┤
│ d      │
├╌╌╌╌╌╌╌╌┤
│ x      │
├╌╌╌╌╌╌╌╌┤
│ y      │
├╌╌╌╌╌╌╌╌┤
│ z      │
└────────┘

Now for the tricky step: how to get the values sorted in the proper order.

We're going to use arg_sort to accomplish this. One property of an arg_sort is that it can restore a sorted column back to its original un-sorted state.

If we look at the values above, the non-replaced values are at the top (corresponding to a mask value of False). And the replaced values are at the bottom (corresponding to a mask value of True). This corresponds to a mask of [False, False, True, True, True].

That, in turn, corresponds to the mask expression when it is sorted. (False sorts before True). Hence, sorting the column by the arg_sort of the mask will restore the column to correspond to the original un-sorted maskcolumn.

    result = (
        self
        .sort_by(mask)
        .slice(0, mask.is_not().sum())
        .append(values)
        .sort_by(mask.arg_sort())
    )
shape: (5, 1)
┌────────┐
│ result │
│ ---    │
│ str    │
╞════════╡
│ x      │
├╌╌╌╌╌╌╌╌┤
│ b      │
├╌╌╌╌╌╌╌╌┤
│ y      │
├╌╌╌╌╌╌╌╌┤
│ d      │
├╌╌╌╌╌╌╌╌┤
│ z      │
└────────┘

It subtle, but it works.

I appreciate the above may be more than you wanted. But hopefully, it demonstrates how we can compose our own Expressions using existing Polars Expressions.

  • Thanks so much for this. Out of curiosity, would this be more or less performant than composing a sequence of using `idx.arg_true()` followed by a composed sequence of when/then expressions in conjunction with `.with_row_number()`? – NedDasty Sep 18 '22 at 18:28
  • Two thoughts. `when/then/otherwise` requires that the columns/expressions in `then` and `otherwise` have the same length. In this case, our `values` array is always shorter. (This is essentially the entire problem here - because if `values` was the same length as your columns, you could simply use a `when/then/otherwise`.) Second, the Expression above can be used in Lazy mode. `arg_true` is not available in Lazy mode. Thus, you might miss out on the optimizations that come with Lazy mode. –  Sep 18 '22 at 19:27