How to do regression (simple linear for example) in polars select or groupby context?

Question

I am using polars in place of pandas. I am quite amazed by the speed and lazy computation/evaluation. Right now, there are a lot of methods on lazy dataframe, but they can only drive me so far.

So, I am wondering what is the best way to use polars in combination with other tools to achieve more complicated operations, such as regression/model fitting.

To be more specific, I will give an example involving linear regression.

Assume I have a polars dataframe with columns day, y, x1 and x2, and I want to generate a series, which is the residual of regressing y on x1 and x2 group by day. I have included the code example as follows and how it can be solved using pandas and statsmodels. How can I get the same result with the most efficiency using idiomatic polars?

import pandas as pd
import statsmodels.api as sm

def regress_resid(df, yvar, xvars):
    result = sm.OLS(df[yvar], sm.add_constant(df[xvars])).fit()
    return result.resid

df = pd.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
)

df.groupby("day").apply(regress_resid, "y", ["x1, "x2])
# day
# 1    0    0.772431
#      1   -0.689233
#      2   -1.167210
#      3   -0.827896
#      4    1.911909
# 2    5   -0.851691
#      6    1.719451
#      7   -1.167727
#      8    0.354871
#      9   -0.054905

Thanks for your help.

Can you add an example with input data and your desired output dataframe? Maybe you can show how you would do it in pandas? — ritchie46, Dec 23 '22 at 07:34

score 6 · Accepted Answer · answered Dec 24 '22 at 09:27

If you want to pass multiple columns to a function, you have to pack them into a Struct as polars expression always map from Series -> Series.

Because polars does not use numpy memory which statsmodels does, you must convert the polars types to_numpy. This is often free in case of 1D structures.

Finally, the function should not return a numpy array, but a polars Series instead, so we convert the result.

import polars as pl
from functools import partial
import statsmodels.api as sm

def regress_resid(s: pl.Series, yvar: str, xvars: list[str]) -> pl.Series:
    df = s.struct.to_frame()
    yvar = df[yvar].to_numpy()
    xvars = df[xvars].to_numpy()
    
    result = sm.OLS(yvar, sm.add_constant(xvars)).fit()
    return pl.Series(result.resid)
    

df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
)

(df.groupby("day")
   .agg([
       pl.struct(["y", "x1", "x2"]).apply(partial(regress_resid, yvar="y", xvars=["x1", "x2"]))
   ])
)

Thanks! Your answer is very helpful. How about the performance considerations? Does 'apply a custom function' come with some performance costs? If so, how much? How do you think if Polars implement commonly used regression functionalities natively? — lebesgue, Dec 24 '22 at 18:27
How to explode the result and align with original dataframe? — lebesgue, Jan 24 '23 at 00:44

score 4 · Answer 2 · answered Dec 23 '22 at 12:28

Since all you're asking for is simple linear regression residuals, we can do this with just a few Polars expressions:

import numpy as np
import polars as pl

# create some correlated data
generator = np.random.default_rng(seed=4)
x = generator.normal(size=100)
error = generator.normal(size=100)
df = pl.DataFrame({"x": x, "y": 2 + x + error})

def residuals(x: pl.Expr, y: pl.Expr) -> pl.Expr:
    # e_i = y_i - a - bx_i
    #     = y_i - ȳ + bx̄ - bx_i
    #     = y_i - ȳ - b(x_i - x̄)
    x_mean = x.mean()
    y_mean = y.mean()
    x_demeaned = x - x.mean()
    y_demeaned = y - y.mean()
    x_demeaned_squared = x_demeaned.pow(2)
    beta = x_demeaned.dot(y_demeaned) / x_demeaned_squared.sum()
    return y_demeaned - beta * x_demeaned

print(
    df
    .with_column(residuals(pl.col("x"), pl.col("y")).alias("e"))
    .head(3) # remove this to get the entire result
)

The first few rows look like this:

shape: (3, 3)
┌───────────┬──────────┬───────────┐
│ x         ┆ y        ┆ e         │
│ ---       ┆ ---      ┆ ---       │
│ f64       ┆ f64      ┆ f64       │
╞═══════════╪══════════╪═══════════╡
│ -0.651791 ┆ 1.257485 ┆ -0.146135 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ -0.174717 ┆ 1.704334 ┆ -0.260438 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.663724  ┆ 4.740446 ┆ 0.613234  │
└───────────┴──────────┴───────────┘

Thanks for your answer. Yes I am asking for simple linear regression just for illustration purposes, and your answer works well for it. But what I am really asking is the framework/paradigm of using polars with other tools/packages (such as statsmodels, again, as an example just for illustration purposes). So, your solution may not extend easily if I want to switch to other models. It would be great if the answer is presented with a general extensible framework/paradigm and use simple linear regression model just as a concrete example to showcase it. Thanks! — lebesgue, Dec 23 '22 at 14:06
Is there possible to implement OLS with potentially multiple independent variables just by using polars expressions? — lebesgue, Jan 30 '23 at 18:06

How to do regression (simple linear for example) in polars select or groupby context?

2 Answers2

Linked