2

I have a polars LazyFrame which has 3 columns of type nullable list[f64], something like this.

import polars as pl

lf = pl.DataFrame({
    "1": [
        [0.0, 1.1, 2.2],
        [0.0, 1.1, 2.2],
        [0.0, 1.1, 2.2],
        None,
    ],
    "2": [
        [0.3, 1.3, 2.3],
        [0.4, 1.4, 2.4],
        [0.5, 1.5, 2.5],
        None,
    ],
    "3": [
        [0.7, 1.7, 2.7],
        None,
        [0.9, 1.9, 2.9],
        None,
    ],
}).lazy()
┌─────────────────┬─────────────────┬─────────────────┐
│ 1               ┆ 2               ┆ 3               │
│ ---             ┆ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╪═════════════════╡
│ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] │
│ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null            │
│ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] │
│ null            ┆ null            ┆ null            │
└─────────────────┴─────────────────┴─────────────────┘

I need to add a column with the average of the three column's lists, furthermore

  • when in a row there are only null, then avg will be a zeros-filled list of fixed length 3
  • when one item is null the average will be computed on the non-null lists

For "average of the lists" I mean the element-wise sum divided by the number of lists involved in the sum.

So in the first row I want:

[
    0.0 + 0.3 + 0.7,
    1.1 + 1.3 + 1.7,
    2.2 + 2.3 + 2.7
] / 3
=
[
    1.0, 4.1, 7.2
] / 3
=
[0.33, 1.36, 2.40]

In the second row: [0.0 + 0.4, 1.1 + 1.4, 2.2 + 2.4] / 2 = [0.4, 2.8, 4.6] / 2 = [0.2, 1.4, 2.3].

In the last row: [0.0, 0.0, 0.0].

I found a way to sum the columns

lf.select(
    pl.sum_horizontal(
        pl.col("*").list.explode()
    ).reshape((1, -1)).alias("sum"),
).collect()

But this only works when all the items in row are non-null.

Nemoos
  • 35
  • 4

2 Answers2

2
lf.with_columns(
    avg=pl.when(pl.all_horizontal(pl.all().is_null()))
    .then([0, 0, 0])
    .otherwise(
        pl.reduce(
            lambda acc, x: pl.when(x.is_null())
            .then(acc)
            .when(acc.is_null())
            .then(x)
            .otherwise(acc.list.concat(x)),
            pl.all(),
        )
    )
).with_columns(
    avg=pl.concat_list(
        pl.col('avg')
        .list.take(pl.int_ranges(i, pl.col('avg').list.lengths(), 3))
        .list.sum()
        / (pl.col('avg').list.lengths() / 3)
        for i in range(3)
    )
).collect()
shape: (4, 4)
┌─────────────────┬─────────────────┬─────────────────┬───────────────────────────┐
│ 1               ┆ 2               ┆ 3               ┆ avg                       │
│ ---             ┆ ---             ┆ ---             ┆ ---                       │
│ list[f64]       ┆ list[f64]       ┆ list[f64]       ┆ list[f64]                 │
╞═════════════════╪═════════════════╪═════════════════╪═══════════════════════════╡
│ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] ┆ [0.333333, 1.366667, 2.4] │
│ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null            ┆ [0.2, 1.25, 2.3]          │
│ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] ┆ [0.466667, 1.5, 2.533333] │
│ null            ┆ null            ┆ null            ┆ [0.0, 0.0, 0.0]           │
└─────────────────┴─────────────────┴─────────────────┴───────────────────────────┘

The main step here is a pl.reduce (horizontal, row-wise fold reduction) where if the intermediate element x is null, we do nothing (return acc in that intermediate step), or vice versa, or otherwise concatenate the lists all together. So all the numbers are there to do the calculation, then it comes down to appropriately take-ing each set of elements to do math on.

Some of these 3s should be generalizable as len(lf.columns) before the entire operation.

Wayoshi
  • 1,688
  • 1
  • 7
1

Perhaps you could expand on your initial attempt by replacing null with [0, 0, 0] and manually creating a list of [width, width, width] to divide by.

width = pl.sum_horizontal(pl.all().is_not_null()) 

df.with_columns(avg = 
   (pl.sum_horizontal(
      pl.when(pl.all().is_null()).then([0, 0, 0]).otherwise(pl.all()).list.explode()
   ) / pl.concat_list(width, width, width).list.explode())
   .fill_nan(0)
   .reshape((-1, 3))
)
shape: (4, 4)
┌─────────────────┬─────────────────┬─────────────────┬───────────────────────────┐
│ 1               ┆ 2               ┆ 3               ┆ avg                       │
│ ---             ┆ ---             ┆ ---             ┆ ---                       │
│ list[f64]       ┆ list[f64]       ┆ list[f64]       ┆ list[f64]                 │
╞═════════════════╪═════════════════╪═════════════════╪═══════════════════════════╡
│ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] ┆ [0.333333, 1.366667, 2.4] │
│ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null            ┆ [0.2, 1.25, 2.3]          │
│ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] ┆ [0.466667, 1.5, 2.533333] │
│ null            ┆ null            ┆ null            ┆ [0.0, 0.0, 0.0]           │
└─────────────────┴─────────────────┴─────────────────┴───────────────────────────┘

Or as a possible function:

def list_avg(size):
    # replace `null` with lists of `0`
    cols = pl.when(pl.all().is_null()).then([0] * size).otherwise(pl.all())
    cols = cols.list.explode()
    
    # number of non-null columns
    num_cols = pl.sum_horizontal(pl.all().is_not_null()) 
    num_cols = pl.concat_list([num_cols] * size).list.explode()
    
    return (pl.sum_horizontal(cols) / num_cols).fill_nan(0).reshape((-1, size))
    
    
df.with_columns(avg = list_avg(size=3))
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • 1
    To create the list of [width, width, width] I used *[width for _ in range(3)], this way I can replace that 3 with the length of the list – Nemoos Aug 17 '23 at 08:53
  • 1
    Yeah, good point. As it is somewhat complex - maybe creating a function could be useful. I've updated with an example. – jqurious Aug 17 '23 at 14:24