7

I am running a linear regression using Spark Pipelines in pyspark. Once the linear regression model is trained, how do I get the coefficients out?

Here is my pipeline code:

# Get all of our features together into one array called "features".  Do not include the label!
feature_assembler = VectorAssembler(inputCols=get_column_names(df_train), outputCol="features")

# Define our model
lr = LinearRegression(maxIter=100, elasticNetParam=0.80, labelCol="label", featuresCol="features", 
                  predictionCol = "prediction")

# Define our pipeline
pipeline_baseline = Pipeline(stages=[feature_assembler, lr])

# Train our model using the training data
model_baseline = pipeline_baseline.fit(df_train)

# Use our trained model to make predictions using the validation data
output_baseline = model_baseline.transform(df_val)  #.select("features", "label", "prediction", "coefficients")
predictions_baseline = output_baseline.select("label", "prediction")

I have tried using methods from the PipelineModel class. Here are my attempts to get the coefficients, but I only get an empty list and an empty dictionary:

params = model_baseline.stages[1].params
print 'Try 1 - Parameters: %s' %(params)
params = model_baseline.stages[1].extractParamMap()
print 'Try 2 - Parameters: %s' %(params)

Out[]:
Try 1 - Parameters: []
Try 2 - Parameters: {}

Are there methods for PipelineModel that return the trained coefficients?

M. Oneto
  • 121
  • 1
  • 3

1 Answers1

8

You are looking at the wrong property. params can be used to extract Estimator or Transformer Params like input or output columns (see ML Pipeline parameters docs and not estimated values.

For LinearRegressionModel use coefficients:

model.stages[-1].coefficients
Christian Alis
  • 6,556
  • 5
  • 31
  • 29
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Perfect! Thank you. This is exactly what I was looking for. Do you also know how I can get the hyper-parameter values out (e.g. regParam or elasticNetParam)? This is a new application. I am running a [CrossValidator](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html?highlight=crossvalidator#pyspark.ml.tuning.CrossValidator) instance to try different hyper-parameters. Once the best model is found, I want to know which hyper-parameters are used by the best model. `model.bestModel.stages[-1].coefficients` gets me the coefficients of the best linear regression model. – M. Oneto Aug 04 '16 at 21:34
  • That doesn't work for me. I get: `TypeError: 'Param' object does not support indexing` – Chuck Mar 12 '20 at 15:21
  • If you are looking at the pipeline, use `pipeline.getStages[-1]`, if you are dealing with a model, use `model.stages[-1]`, see https://stackoverflow.com/questions/38664620/any-way-to-access-methods-from-individual-stages-in-pyspark-pipelinemodel – Chuck Mar 12 '20 at 17:12
  • is there any way to do this without relying indexing? what if the model you want coefficients for isn't in the [-1] position? (maybe you want to put an evaluator after it) – justin cress May 20 '20 at 15:48