1

The task at hand is to do multiple linear regression over multiple columns in groupby context and return respective beta coefficients and their associated t-values in separate columns.

Below is an illustration of an attempt to do such using statsmodels.

from functools import partial

import numpy as np
import polars as pl
import statsmodels.api as sm


def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    yvar = df[yvar].to_numpy()
    xvars = df[xvars].to_numpy()
    reg = sm.OLS(yvar, sm.add_constant(xvars), missing="drop").fit()
    return np.concatenate((reg.params, reg.tvalues))



df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3, 1],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3, 1],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1, 1],
    }
)

df.groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
)

The result from the code snippet above evaluates to

shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params                              │
│ --- ┆ ---                                 │
│ i64 ┆ object                              │
╞═════╪═════════════════════════════════════╡
│ 2   ┆ [2.0462002  0.22397054 0.3367927... │
│ 3   ┆ [0.5 0.5 0.  0. ]                   │
│ 1   ┆ [ 4.86623165  0.64029364 -0.6598... │
└─────┴─────────────────────────────────────┘

How am I supposed to split the 'params' into separate columns with one scalar value in each column?

Also, my code seems to fail at some corner cases. Below is one of them.

df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3, None],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3, 1],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1, 1],
    }
)

df.groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
)

>>> exceptions.ComputeError: ValueError: exog is not 1d or 2d

How can I make the code robust to such case?

Thanks for your help. And feel free to suggest your own solution.

lebesgue
  • 837
  • 4
  • 13

2 Answers2

3

Note that the type of the params column is object - this is not what you want.

Then you can use .arr.to_struct() to turn the list into a struct which allows you to .unnest()

df.groupby("day").agg(
   pl.struct(["y", "x1", "x2"])
     .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
     .alias("params")
).with_columns(pl.col("params").arr.to_struct()).unnest("params")
shape: (3, 7)
┌─────┬──────────┬──────────┬───────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  ┆ field_4  ┆ field_5   │
│ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪═══════════╪══════════╪══════════╪═══════════╡
│ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 ┆ 1.81586  ┆ -1.430613 │
│ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 ┆ 0.495378 ┆ 1.091109  │
│ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 0.0      ┆ null     ┆ null      │
└─────┴──────────┴──────────┴───────────┴──────────┴──────────┴───────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thanks for the answer. Is it possible to give aliases to those unnested fields? – lebesgue Feb 21 '23 at 14:25
  • [`.struct.rename_fields()`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.struct.rename_fields.html#polars.Expr.struct.rename_fields) – jqurious Feb 21 '23 at 14:29
  • have a variation of the problem for lazy dataframe here - – lebesgue Mar 03 '23 at 22:19
0

To convert List column to multiple columns, you can use for loop:

df.select([
    pl.col("params").arr.get(i).alias(f"param_{i}") 
    for i in range(df["params"].arr.lengths().max())
    #                                         /
    #                                use .max() if the length of the lists
    #                                may differ from row to row
])
glebcom
  • 1,131
  • 5
  • 14