Numerical stability of Expr.mean in GroupBy context

Question

Numerical stability of Expr.mean in GroupBy context (GroupBy.mean()) seems not only worse than the pandas version, but also worse than Expr.mean in select context.

import numpy as np
import pandas as pd
import polars as pl

df = pd.DataFrame({'data':[10_00_00_00]*100_00_00, 'group':[1,2]*50_00_00}, dtype=np.int32)
print(df.groupby('group').mean())
print(pl.from_pandas(df).groupby('group').agg(pl.col('data').mean()))
print(pl.from_pandas(df).groupby('group').mean())

"""
             data
group
1      10000000.0
2      10000000.0

shape: (2, 2)
┌───────┬─────────────┐
│ group ┆ data        │
│ ---   ┆ ---         │
│ i32   ┆ f64         │
╞═══════╪═════════════╡
│ 2     ┆ 1316.134912 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 1316.134912 │
└───────┴─────────────┘

shape: (2, 2)
┌───────┬─────────────┐
│ group ┆ data        │
│ ---   ┆ ---         │
│ i32   ┆ f64         │
╞═══════╪═════════════╡
│ 2     ┆ 1316.134912 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 1316.134912 │
└───────┴─────────────┘

"""

But Expr.mean in select context does works:

print(df.mean())
print(pl.from_pandas(df).select(pl.col('data').mean()))
print(pl.from_pandas(df).mean())

"""
data     10000000.0
group           1.5
dtype: float64

shape: (1, 1)
┌──────┐
│ data │
│ ---  │
│ f64  │
╞══════╡
│ 1e7  │
└──────┘

shape: (1, 2)
┌──────┬───────┐
│ data ┆ group │
│ ---  ┆ ---   │
│ f64  ┆ f64   │
╞══════╪═══════╡
│ 1e7  ┆ 1.5   │
└──────┴───────┘
"""

I am on Windows with polars==0.14.18. I know int64 is the solution, but it is a little bit confusing, is it a bug?

That is an overflow issue. I have open an issue to fix this: https://github.com/pola-rs/polars/issues/5194 — ritchie46, Oct 13 '22 at 14:19

Numerical stability of Expr.mean in GroupBy context

0 Answers0