0

I want to get the average of a list of columns within a polars dataframe, but am getting stuck. For example:

df = pl.DataFrame({
    'a':[1,2,3],
    'b':[4,5,6],
    'c':[7,8,9]
})

cols_to_mean = ['a','c']

This works:

df.select(pl.col(cols_to_mean))

In that it returns just those columns, but when I try to calculate the mean, this line

df.select(pl.col(cols_to_mean).mean())

Returns the mean of each column (while I want a column the same length as each that is the mean of them both for each row). There isn't an option to pass an axis to the mean function. I also try:

df.select(pl.mean(pl.col(cols_to_mean).mean()))

But this produces an error:

TypeError: Invalid input for `col`. Expected `str` or `DataType`, got 

Is there a way to do this?

Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24
Paul Fleming
  • 414
  • 2
  • 11
  • 3
    What should the output be? Do you want the [mean of each row](https://pola-rs.github.io/polars-book/user-guide/dsl/list_context.html) e.g. `df.select(pl.concat_list(cols_to_mean).arr.mean())`? – jqurious Apr 28 '23 at 21:40

2 Answers2

2

Here is the code and the result. (as also mentioned by @jqurious and @Rakesh Chaudhary)

df.select(
    col_mean = pl.concat_list(cols_to_mean).list.mean()
)

shape: (3, 1)
┌──────────┐
│ col_mean │
│ ---      │
│ f64      │
╞══════════╡
│ 4.0      │
│ 5.0      │
│ 6.0      │
└──────────┘

EDIT: use list instead of arr following Polars update

Luca
  • 1,216
  • 6
  • 10
  • Playing around with this more, I was seeing the above solution has an intermediate step where it creates a column where each element is a list. Then it takes a mean of each element in that column. I was wondering if there really isn't a more direct route like the pandas approach: ```df = pd.DataFrame({ 'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9] }) df[['a','c']].mean(axis=1)```. My best effort so far is ```df.select( col_mean = pl.Series(df.select(cols_to_mean).to_numpy().mean(axis=1)) )``` but that entails a round-trip to numpy – Paul Fleming May 01 '23 at 12:55
  • 1
    Hi @PaulFleming, the 'Polars' way of expressing axis=1 is indeed using concat_list(). Behind the scenes, concat_list() is a view on your existing data so it does not do any copies of the data. So this would be the most efficient way to do it in Polars. Copying to numpy will add a roundtrip to numpy which can slow down the query. – Luca May 01 '23 at 13:23
  • hi @Luca, it's interesting this solution worked in the past, but maybe now an update to polars if I run this code I get: ```AttributeError: 'ExprArrayNameSpace' object has no attribute 'mean'```, have you seen this? – Paul Fleming Jun 13 '23 at 22:17
  • 1
    But I think this now gives a good result: ```df.select(cols_to_mean).mean(axis=1).alias('col_mean')``` – Paul Fleming Jun 13 '23 at 22:24
  • this version works, but only on a dataframe, it's a little harder to chain than the former version – Paul Fleming Jun 13 '23 at 22:33
  • Hi @PaulFleming you are right , following Polars update the right way to do it is using list instead of arr – Luca Jun 15 '23 at 04:48
  • The only issue I'm having is that it was easier to embed the old method within a with_columns statement, however because the new method uses a select statement, it returns a dataframe, and wants a dataframe as an startingpoint (I cant yet make it work with pl.columns), but maybe I'm missing a better approach, thanks a lot for your help! – Paul Fleming Jun 16 '23 at 05:15
  • Hi @PaulFleming, the new statement should also work with a `with_columns` context. The only difference should be a change in namespace: instead of calling `.arr.mean()`, we now call `.list.mean()` – Luca Jun 17 '23 at 16:30
  • 1
    `pl.concat_list(cols_to_mean).list.mean()` is faster than `.select(pl.col(cols_to_mean)).mean(axis=1)` I found – Hakase Jul 21 '23 at 04:23
-3

df = pl.DataFrame({ "col1": [1, 2, 3], "col2": [4, 5, 6], "col3": [7, 8, 9] })