I am using polars in place of pandas. I am quite amazed by the speed and lazy computation/evaluation. Right now, there are a lot of methods on lazy dataframe, but they can only drive me so far.
So, I am wondering what is the best way to use polars in combination with other tools to achieve more complicated operations, such as regression/model fitting.
To be more specific, I will give an example involving linear regression.
Assume I have a polars dataframe with columns day, y, x1 and x2, and I want to generate a series, which is the residual of regressing y on x1 and x2 group by day. I have included the code example as follows and how it can be solved using pandas and statsmodels. How can I get the same result with the most efficiency using idiomatic polars?
import pandas as pd
import statsmodels.api as sm
def regress_resid(df, yvar, xvars):
result = sm.OLS(df[yvar], sm.add_constant(df[xvars])).fit()
return result.resid
df = pd.DataFrame(
{
"day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
"x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
"x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
}
)
df.groupby("day").apply(regress_resid, "y", ["x1, "x2])
# day
# 1 0 0.772431
# 1 -0.689233
# 2 -1.167210
# 3 -0.827896
# 4 1.911909
# 2 5 -0.851691
# 6 1.719451
# 7 -1.167727
# 8 0.354871
# 9 -0.054905
Thanks for your help.