There's a pretty big difference between the way R's data.table
works and how polars works. data.table can take any R function equally well (even user generated ones) and everything you do in the brackets is basically like an R environment. Additionally, data.table elements can be any type of element. That is to say, you can have a data.table full of lm
outputs. In contrast polars expressions are very rigid, at least, by comparison. That's because everything that happens in polars expressions is "shipped" over to rust and the underlying data is stored in arrow. Another thing that makes data.table different is that when your j
expression isn't a list then it doesn't return a data.table. So in your example of dt[, lm(...)]
, the fact that polars doesn't have that. So you can never do df.select(some_expr())
and get anything but another dataframe.
Doing regressions in lazy mode doesn't really give you any benefits. The idea of lazy computation is that there's a query optimizer which will prune steps because of what the final result is. For example, if you have a chain of df.sort("a").select(pl.col('a').max()).collect()
then in lazy mode it'll ignore the sort and only take the max. If you have df.select(regr.fit(...)).select(other_stuff()).collect()
then the query optimizer can't do anything to optimize the regression. It's still going to materialize all the data and hand it over the sklearn
.
From a syntax perspective though, you can get close to what you're looking for. You can extend the polars namespace so that you can do
dt.skl.regr(y_col, X_cols)
It can't be as an outright expression because the arrow memory won't store the pythonic object that regr.fit returns but it can be a dataframe method.
The docs are here
@pl.api.register_dataframe_namespace("skl")
class skl:
def __init__(self, df: pl.DataFrame):
self._df = df
def regr(self, y_col, X_cols):
return regr.fit(self._df.select(pl.col(X_cols)), self._df.select(pl.col(y_col)))
When you make that extension then all your dataframes inherit that method. Of course, it depends on the global regr
existing. You can define that object in the custom class and have the cv and random state be a parameter or whatever you want to do.