0

I want to compute a linear regression in a polars DataFrame, but not sure what context I should use for that.

import polars as pl
from sklearn.linear_model import ElasticNetCV
from sklearn import datasets

iris = datasets.load_iris()

dt = pl.DataFrame(iris["data"])
dt.columns = iris.feature_names             
dt = dt.with_columns(pl.Series(name = "Species", values = iris.target))

X_cols = ["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]
y_col = "petal width (cm)"

regr = ElasticNetCV(cv=5, random_state=0)

res = regr.fit(dt.select(pl.col(X_cols)), dt.select(pl.col(y_col)))

The last line I want to wrap inside a polars DataFrame expression, ideally in lazy mode.

So the question is how can I use polars to make efficient computations on the DataFrame and return complex non-DataFrame objects, such as a sklearn model object.

In R using the data.table library, this would be possible with :

dt[, lm(...)]
Klumpi
  • 45
  • 5
  • Does this post answer your question? https://stackoverflow.com/questions/74895640/how-to-do-regression-simple-linear-for-example-in-polars-select-or-groupby-con – dsillman2000 Apr 16 '23 at 23:01
  • Scikit-learn doesn't yet support Polars types so after transformation you need to collect the dataframe and export the necessary data to numpy using `to_numpy()` method. Scikit-learn also doesn't support lazy execution model. – NotAName Apr 16 '23 at 23:29
  • Also see discussion here: https://github.com/scikit-learn/scikit-learn/issues/25896 – NotAName Apr 16 '23 at 23:34

1 Answers1

0

There's a pretty big difference between the way R's data.table works and how polars works. data.table can take any R function equally well (even user generated ones) and everything you do in the brackets is basically like an R environment. Additionally, data.table elements can be any type of element. That is to say, you can have a data.table full of lm outputs. In contrast polars expressions are very rigid, at least, by comparison. That's because everything that happens in polars expressions is "shipped" over to rust and the underlying data is stored in arrow. Another thing that makes data.table different is that when your j expression isn't a list then it doesn't return a data.table. So in your example of dt[, lm(...)], the fact that polars doesn't have that. So you can never do df.select(some_expr()) and get anything but another dataframe.

Doing regressions in lazy mode doesn't really give you any benefits. The idea of lazy computation is that there's a query optimizer which will prune steps because of what the final result is. For example, if you have a chain of df.sort("a").select(pl.col('a').max()).collect() then in lazy mode it'll ignore the sort and only take the max. If you have df.select(regr.fit(...)).select(other_stuff()).collect() then the query optimizer can't do anything to optimize the regression. It's still going to materialize all the data and hand it over the sklearn.

From a syntax perspective though, you can get close to what you're looking for. You can extend the polars namespace so that you can do

dt.skl.regr(y_col, X_cols)

It can't be as an outright expression because the arrow memory won't store the pythonic object that regr.fit returns but it can be a dataframe method.

The docs are here

@pl.api.register_dataframe_namespace("skl")
class skl:
    def __init__(self, df: pl.DataFrame):
        self._df = df
    def regr(self, y_col, X_cols):
        return regr.fit(self._df.select(pl.col(X_cols)), self._df.select(pl.col(y_col)))

When you make that extension then all your dataframes inherit that method. Of course, it depends on the global regr existing. You can define that object in the custom class and have the cv and random state be a parameter or whatever you want to do.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72