0

Does polars have an API interface for least squares linear regression?

I can't find it in Polars API Reference

If not, how can I achieve efficient least squares linear regression if I only use the polars library?

import polars as pl

data = {
    'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'y': [22.0, 33.9, 44.8, 78.9, 44.3, 20.5, 30.5, 56.4, 92.3, 22.1, 88, 10.1]
}

df = pl.DataFrame(data)
  • I don't think it has. https://stackoverflow.com/a/74899658 may be of interest. – jqurious May 10 '23 at 13:41
  • To do OLS in *just* polars, no other library, you'd have to figure out matrix multiplication using joins which probably isn't that bad. You'd then have to figure out matrix inversion in expressions which would be pretty challenging. Once you get your Beta estimators, if you want t-statistics and the associated p-values, you'll have to figure out some calculus *in polars*. Just use statsmodels or sklearn with df.to_numpy() to avoid pandas. – Dean MacGregor May 10 '23 at 14:54

1 Answers1

4

You can run a least squares regression with a mix of Polars and Numpy.

However, as Polars is not a data science library, I think it would make sense to use libraries such as sklearn for it.

Here is an example for running a linear regression using Polars and Numpy:

import polars as pl
import numpy as np

# Create a sample dataset
data = {
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 4, 6, 8, 12],
    'Y':  [2, 4, 5, 4, 5]
}
df = pl.DataFrame(data)

# Separate X and Y
X = df.select(
    'X1', 'X2',
    ones = pl.lit(1)
)
Y = df['Y']

# Calculate the parameters
X_transpose = X.transpose()
X_transpose_dot_X = np.dot(X_transpose, X)
X_transpose_dot_X_inv = np.linalg.inv(X_transpose_dot_X)
X_transpose_dot_Y = np.dot(X_transpose, Y)
theta = np.dot(X_transpose_dot_X_inv, X_transpose_dot_Y)

df = df.with_columns(
    Y_pred = pl.lit(np.dot(X, theta))
)

print(df)
print(f"intercept: {theta[-1]}")
print(f"coef_x1: {theta[0]}")
print(f"coef_x2: {theta[1]}")

┌─────┬─────┬─────┬────────┐
│ X1  ┆ X2  ┆ Y   ┆ Y_pred │
│ --- ┆ --- ┆ --- ┆ ---    │
│ i64 ┆ i64 ┆ i64 ┆ f64    │
╞═════╪═════╪═════╪════════╡
│ 1   ┆ 2   ┆ 2   ┆ 2.7    │
│ 2   ┆ 4   ┆ 4   ┆ 3.4    │
│ 3   ┆ 6   ┆ 5   ┆ 4.1    │
│ 4   ┆ 8   ┆ 4   ┆ 4.8    │
│ 5   ┆ 12  ┆ 5   ┆ 5.0    │
└─────┴─────┴─────┴────────┘
intercept: 1.9999999999999947
coef_x1: 1.2000000000000357
coef_x2: -0.25000000000000533
Luca
  • 1,216
  • 6
  • 10
  • I tried to do matrix multiplication with just polars expressions for a couple minutes, then I reminded myself about matrix inversion and said well I'm not doing that so I quit. I'm not sure why OP says they want to use nothing but polars but if they're going to relax that constraint for np they might as well do it for statsmodels or sklearn. That being said, I like the manual approach you did. – Dean MacGregor May 10 '23 at 14:57
  • @DeanMacGregor: yes, I also looked into doing matrix inversion in Polars, but it would be a lot of code, and maybe perform worse than using Numpy or Sklearn. I am proposing Numpy because the mathematical operations are transparent (it's easy to see what's happening) while sklearn and statsmodel are more of black box. In a production environment I would use sklearn or statsmodel, while Numpy might be good for getting some mathematical understanding of what goes on. Anyway yes, I am not sure why the question limits the possible libraries to Polars. – Luca May 10 '23 at 15:00
  • 1
    I have no doubt that it would perform worse, much worse. – Dean MacGregor May 10 '23 at 15:04