4

For a dataframe, the goal is to have the mean of a column - a groupby another column - b given the first value of a in the group is not null, if it is, just return null.

The sample dataframe

df = pl.DataFrame({"a": [None, 1, 2, 3, 4], "b": [1, 1, 2, 2, 2]})

I tried something like

df.groupby("b").agg(
    pl.when(pl.col("a").first().is_null()).then(None).otherwise(pl.mean("a"))
)

The results are as expected but get a warning saying when may not be guaranteed to do its job in groupby context.

>>> df.groupby("b").agg(pl.when(pl.col("a").first().is_null()).then(None).otherwise(pl.mean("a")))
The predicate 'col("a").first().is_null()' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the groupby operation would. This behavior is experimental and may be subject to change
shape: (2, 2)
┌─────┬─────────┐
│ b   ┆ literal │
│ --- ┆ ---     │
│ i64 ┆ f64     │
╞═════╪═════════╡
│ 1   ┆ null    │
│ 2   ┆ 3.0     │
└─────┴─────────┘

May I know why and what could be a better alternative way to do if-else in groupby?

lebesgue
  • 837
  • 4
  • 13

1 Answers1

4

You need to use:

  • pl.col("a").is_null().first()

instead of:

  • pl.col("a").first().is_null()

If we look at both approaches:

df.groupby("b", maintain_order=True).agg(
   pl.col("a"),
   pl.col("a").is_not_null().alias("yes"),
   pl.col("a").first().is_not_null().alias("no"),
)
shape: (2, 4)
┌─────┬───────────┬────────────────────┬───────┐
│ b   ┆ a         ┆ yes                ┆ no    │
│ --- ┆ ---       ┆ ---                ┆ ---   │
│ i64 ┆ list[i64] ┆ list[bool]         ┆ bool  │
╞═════╪═══════════╪════════════════════╪═══════╡
│ 1   ┆ [null, 1] ┆ [false, true]      ┆ false │
│ 2   ┆ [2, 3, 4] ┆ [true, true, true] ┆ true  │
└─────┴───────────┴────────────────────┴───────┘

In the no case, only null and 2 are passed to .is_not_null() - the rest of the inputs have been "silently discarded".

polars knows a has lengths of 2 and 3 and expects "boolean masks" of the same length.

We can take the .first() value of yes which has the same end result:

df.groupby("b", maintain_order=True).agg(
   pl.col("a"),
   pl.col("a").is_not_null().first().alias("yes"),
   pl.col("a").first().is_not_null().alias("no"),
)
shape: (2, 4)
┌─────┬───────────┬───────┬───────┐
│ b   ┆ a         ┆ yes   ┆ no    │
│ --- ┆ ---       ┆ ---   ┆ ---   │
│ i64 ┆ list[i64] ┆ bool  ┆ bool  │
╞═════╪═══════════╪═══════╪═══════╡
│ 1   ┆ [null, 1] ┆ false ┆ false │
│ 2   ┆ [2, 3, 4] ┆ true  ┆ true  │
└─────┴───────────┴───────┴───────┘

But now all of the inputs have been passed to .is_not_null() and the length check passes.


.otherwise

Not related to the issue but: as the default value for .otherwise is None:

df.groupby("b").agg(
   pl.when(pl.col("a").is_null().first())
     .then(None) 
     .otherwise(pl.mean("a"))
)
shape: (2, 2)
┌─────┬─────────┐
│ b   ┆ literal │
│ --- ┆ ---     │
│ i64 ┆ f64     │
╞═════╪═════════╡
│ 1   ┆ null    │
│ 2   ┆ 3.0     │
└─────┴─────────┘

You can invert the logic and allow the default to be used:

df.groupby("b").agg(
   pl.when(pl.col("a").is_not_null().first())
     .then(pl.mean("a"))
)
shape: (2, 2)
┌─────┬──────┐
│ b   ┆ a    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ null │
│ 2   ┆ 3.0  │
└─────┴──────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Could you add some explanation as to _why_ testing null first, then getting the first value is different from testing nullness of the first value? I am hitting the same warning but I really do not see how they should be different. – Dan Apr 12 '23 at 20:25
  • @Dan The issue is `pl.col("a").first().is_not_null()` vs. `pl.col("a").is_not_null()`. If a has a length of `3` - in the first case, `1` value is tested, in the second case, `3` values are tested. This "length mismatch" raises a warning as polars expects all of the inputs (3 values) to be tested. As you only want the first result from case #2, the overall behaviour would not be different. Polars is just detecting that some of your inputs have been "silently discarded" which should not happen under "normal circumstances". – jqurious Apr 12 '23 at 22:06
  • @Dan I've expanded the answer - feel free to suggest edits if you think it can be improved upon. – jqurious Apr 14 '23 at 12:21