9

I am running a logistic regression in PySpark using spark version: 2.1.2

I know it is possible to save a regression model as follows:

model = LogisticRegression(featuresCol='features',
                           labelCol='is_clickout',
                           regParam=0,
                           fitIntercept=False,
                           family="binomial")

model = pipeline.fit(data)

# save model for future use
save_path = "model_0"
model.save(save_path)

The problem is that the saved model does not save the summary:

from pyspark.ml.classification import LogisticRegressionModel
model2 = LogisticRegressionModel.load(save_path)
model2.hasSummary ##### Returns FALSE

I can extract the summary as follows, but it has no save method attached to it:

# Get the model summary
summary = model.stages[-1].summary

Is there a quick way to save the summary object? For multiple regressions?

Currently, I read all the object attributes and save them as a Pandas dataframe df.

Mario
  • 1,631
  • 2
  • 21
  • 51
hamiq
  • 465
  • 1
  • 3
  • 10

1 Answers1

1

Unfortunately, your observation is correct. I had the same problem with Spark 2.4.3 and I've found this comment confirming the issue:

For LinearRegressionModel, this does NOT currently save the training summary. An option to save summary may be added in the future.

This same comment is still there for Spark 3.0.0-rc1 (the last available tag in its repository).

If we want to persist the summary, we need to serialize it somehow ourselves. I've done this before by extracting the statistics I wanted and saving them in a JSON document just after training my model.

boechat107
  • 1,654
  • 14
  • 24