0

I'm trying to find a polaric way of aggregating data per row. It's not strictly about .sum function, it's about all aggregations where axis makes sense. Take a look at these pandas examples:

df[df.sum(axis=1) > 5]

df.assign(median=df.median(axis=1))

df[df.rolling(3, axis=1).mean() > 0]

However, with polars, problems start really quick:

df.filter(df.sum(axis=1)>5)

df.with_column(df.mean(axis=1).alias('mean')) - cant do median

df... - cant do rolling, rank and anything more complex.

I saw the page where polars authors suggest doing everything by hand with folds, but there are cases where logic doesn't fit into one input and one accumulator variable (i.e. simple median)

Moreover, this approach seems to not work at all when using Expressions, i.e. pl.all().sum(axis=1) is not valid since for some reason axis argument is absent.

So the question is: how to deal with these situations? I hope to have the full polars api at my fingertips, instead of some suboptimal solutions i can come up with

1 Answers1

2

Row-wise computations:

You can create a list and access the .arr namespace for row wise computations.

@ritchie46's answer regarding rank(axis=1) is also useful reading.

.arr.eval() can be used for more complex computations.

df = pl.DataFrame([[1, 2, 3], [4, 5, 3], [1, 8, 9]])
(df.with_column(pl.concat_list(pl.all()).alias("row"))
   .with_columns([
      pl.col("row").arr.sum().alias("sum"),
      pl.col("row").arr.mean().alias("mean"),
      pl.col("row").arr.eval(pl.all().median(), parallel=True).alias("median"),
      pl.col("row").arr.eval(pl.all().rank(),   parallel=True).alias("rank"),
   ])
)
shape: (3, 8)
┌──────────┬──────────┬──────────┬───────────┬─────┬──────┬───────────┬─────────────────┐
│ column_0 | column_1 | column_2 | row       | sum | mean | median    | rank            │
│ ---      | ---      | ---      | ---       | --- | ---  | ---       | ---             │
│ i64      | i64      | i64      | list[i64] | i64 | f64  | list[f64] | list[f32]       │
╞══════════╪══════════╪══════════╪═══════════╪═════╪══════╪═══════════╪═════════════════╡
│ 1        | 4        | 1        | [1, 4, 1] | 6   | 2.0  | [1.0]     | [1.5, 3.0, 1.5] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 2        | 5        | 8        | [2, 5, 8] | 15  | 5.0  | [5.0]     | [1.0, 2.0, 3.0] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 3        | 3        | 9        | [3, 3, 9] | 15  | 5.0  | [3.0]     | [1.5, 1.5, 3.0] │
└──────────┴──────────┴──────────┴───────────┴─────┴──────┴───────────┴─────────────────┘

pl.sum()

Can be given a list of columns.

>>> df.select(pl.sum(pl.all()))
shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 6   │
├─────┤
│ 15  │
├─────┤
│ 15  │
└─────┘

.rolling_mean()

Can be accessed inside .arr.eval()

pdf = df.to_pandas()
pdf[pdf.rolling(2, axis=1).mean() > 3]
   column_0  column_1  column_2
0       NaN       NaN       NaN
1       NaN       5.0       8.0
2       NaN       NaN       9.0
(df.with_column(pl.concat_list(pl.all()).alias("row"))
   .with_column(
      pl.col("row").arr.eval(
         pl.when(pl.all().rolling_mean(2) > 3)
           .then(pl.all()), 
         parallel=True)
      .alias("rolling[mean] > 3"))
)
shape: (3, 5)
┌──────────┬──────────┬──────────┬───────────┬────────────────────┐
│ column_0 | column_1 | column_2 | row       | rolling[mean] > 3  │
│ ---      | ---      | ---      | ---       | ---                │
│ i64      | i64      | i64      | list[i64] | list[i64]          │
╞══════════╪══════════╪══════════╪═══════════╪════════════════════╡
│ 1        | 4        | 1        | [1, 4, 1] | [null, null, null] │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 2        | 5        | 8        | [2, 5, 8] | [null, 5, 8]       │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 3        | 3        | 9        | [3, 3, 9] | [null, null, 9]    │
└──────────┴──────────┴──────────┴───────────┴────────────────────┘

If you want to "expand" the lists into columns:

  1. Turn the list into a struct with .arr.to_struct()
  2. .unnest() the struct.
  3. Rename the columns (if needed)
(df.with_column(pl.concat_list(pl.all()).alias("row"))
   .select(
      pl.col("row").arr.eval(
         pl.when(pl.all().rolling_mean(2) > 3)
           .then(pl.all()),
         parallel=True)
      .arr.to_struct()
      .alias("rolling[mean]"))
   .unnest("rolling[mean]")
)
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ field_0 | field_1 | field_2 │
│ ---     | ---     | ---     │
│ i64     | i64     | i64     │
╞═════════╪═════════╪═════════╡
│ null    | null    | null    │
├─────────┼─────────┼─────────┤
│ null    | 5       | 8       │
├─────────┼─────────┼─────────┤
│ null    | null    | 9       │
└─────────┴─────────┴─────────┘

.transpose()

You could always transpose the dataframe to switch the axis and use the "regular" api.

(df.transpose()
   .select(
      pl.when(pl.all().rolling_mean(2) > 3)
        .then(pl.all())
        .keep_name())
   .transpose())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 | column_1 | column_2 │
│ ---      | ---      | ---      │
│ i64      | i64      | i64      │
╞══════════╪══════════╪══════════╡
│ null     | null     | null     │
├──────────┼──────────┼──────────┤
│ null     | 5        | 8        │
├──────────┼──────────┼──────────┤
│ null     | null     | 9        │
└──────────┴──────────┴──────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thank you so much for the great answer! but tell me, is it a good idea (performance-wise) to have all those lists? maybe I should explode them and `.groupby.agg()`? Or maybe `transpose` is going to be a better option? Im going to test all the cases myself, but is there something you could recommend as a rule of thumb ? – iliya malecki Dec 22 '22 at 14:42