6

I'm unable to use polars dataframes with scikitlearn for ML training.

Currently I'm doing all the dataframe preprocessing in polars and during model training i'm converting it into a pandas one in order for it to work.

Is there any method to directly use polars dataframe as it is for ML training without changing it to pandas?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
RKCH
  • 219
  • 3
  • 9

2 Answers2

6

You must call to_numpy when passing a DataFrame to sklearn. Though sometimes sklearn can work on polars Series it is still good type hygiene to transform to the type the host library expects.

import polars as pl
from sklearn.linear_model import LinearRegression

data = pl.DataFrame(
    np.random.randn(100, 5)
)

x = data.select([
    pl.all().exclude("column_0"),
])

y = data.select(pl.col("column_0").alias("y"))


x_train = x[:80]
y_train = y[:80]

x_test = x[80:]
y_test = y[80:]


m = LinearRegression()

m.fit(X=x_train.to_numpy(), y=y_train.to_numpy())
m.predict(x_test.to_numpy())
ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • Thank you. I just need one more thing during encoding of categorical features using sklearn's one hot encoding or any other techniques what should I do then. It is not getting encoded. – RKCH Nov 11 '22 at 12:47
  • Also, if i convert x_train or x_test to multidimensional numpy arrays, then i want to know whether more memory is used or not (I mean some memory for polars dataframe + some memory for numpy array). If it occupies extra memory then it will be a problem na if my system has less ram. – RKCH Nov 11 '22 at 13:01
1
encoding_transformer1 = ColumnTransformer(
    [("Normalizer", Normalizer(), ['Age', 'Fare']),
     ("One-hot encoder",
      OneHotEncoder(dtype=int, handle_unknown='infrequent_if_exist'),
      ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'])],
    n_jobs=-1,
    verbose=True,
    verbose_feature_names_out=True)

encoding_transformer1.fit(xtrain)
train_data = encoding_transformer1.transform(xtrain).tocsr()
test_data = encoding_transformer1.transform(xtest).tocsr()

I'm getting this error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

what should i do?

RKCH
  • 219
  • 3
  • 9
  • I found one hot encoding by to_dummies method, but how can we do other encoding methods. – RKCH Nov 11 '22 at 13:34