Numerical stability of Expr.mean
in GroupBy
context (GroupBy.mean()
) seems not only worse than the pandas version, but also worse than Expr.mean
in select
context.
import numpy as np
import pandas as pd
import polars as pl
df = pd.DataFrame({'data':[10_00_00_00]*100_00_00, 'group':[1,2]*50_00_00}, dtype=np.int32)
print(df.groupby('group').mean())
print(pl.from_pandas(df).groupby('group').agg(pl.col('data').mean()))
print(pl.from_pandas(df).groupby('group').mean())
"""
data
group
1 10000000.0
2 10000000.0
shape: (2, 2)
┌───────┬─────────────┐
│ group ┆ data │
│ --- ┆ --- │
│ i32 ┆ f64 │
╞═══════╪═════════════╡
│ 2 ┆ 1316.134912 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1316.134912 │
└───────┴─────────────┘
shape: (2, 2)
┌───────┬─────────────┐
│ group ┆ data │
│ --- ┆ --- │
│ i32 ┆ f64 │
╞═══════╪═════════════╡
│ 2 ┆ 1316.134912 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1316.134912 │
└───────┴─────────────┘
"""
But Expr.mean
in select
context does works:
print(df.mean())
print(pl.from_pandas(df).select(pl.col('data').mean()))
print(pl.from_pandas(df).mean())
"""
data 10000000.0
group 1.5
dtype: float64
shape: (1, 1)
┌──────┐
│ data │
│ --- │
│ f64 │
╞══════╡
│ 1e7 │
└──────┘
shape: (1, 2)
┌──────┬───────┐
│ data ┆ group │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪═══════╡
│ 1e7 ┆ 1.5 │
└──────┴───────┘
"""
I am on Windows with polars==0.14.18. I know int64
is the solution, but it is a little bit confusing, is it a bug?