This is a new question/issue as a follow up to How to return multiple stats as multiple columns in Polars grouby context? and How to flatten/split a tuple of arrays and calculate column means in Polars dataframe?
Basically, the problem/issue can be easily illustrated by the example below:
from functools import partial
import polars as pl
import statsmodels.api as sm
def ols_stats(s, yvar, xvars):
df = s.struct.unnest()
reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
return pl.Series(values=(reg.params, reg.tvalues), nan_to_null=True)
df = pl.DataFrame(
{
"day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
"x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
"x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
}
).lazy()
res = df.groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("params")
)
res.with_columns(
pl.col("params").arr.eval(pl.element().arr.explode()).arr.to_struct()
).unnest("params").collect()
After running the code above, the following error is got:
pyo3_runtime.PanicException: not implemented for dtype Unknown
But when .lazy()
and .collect()
are removed from the code above, the code works perfectly as intended. Below are the results (expected behavior) if running in eager mode.
shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 2 ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151 │
│ 1 ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
└─────┴──────────┴──────────┴──────────┴───────────┘
So, what is the problem and how am I supposed to resolve it?