5

Using PySpark's ML module, the following steps often occur (after data cleaning, etc):

  1. Perform feature and target transform pipeline
  2. Create model
  3. Generate predictions from the model
  4. Merge predictions and original dataset together for business users and for model validation purposes

Taking a boiled-down snippet of code:

predictions = model.transform(test_df)

This predictions dataframe will only have the predictions (and the probabilities and maybe a transformation of the predictions). But it will not contain the original dataset.

How Can I Combine Predictions with Original PySpark DataFrame?

It is not obvious to me how I can combine that original dataset (or even the transformed test_df) and the predictions; there is no shared column to join on, and adding an index column seems quite tricky for large datasets.

Current Solution:

For large datasets, like what I am working with, I have tried the suggestion here:

test_df = test_df.repartition(predictions.rdd.getNumPartitions())
joined_schema = StructType(test_df.schema.fields + predictions.schema.fields)
interim_rdd = test_df.rdd.zip(predictions.rdd).map(lambda x: x[0] + x[1])
full_data = spark.createDataFrame(interim_rdd, joined_schema)
full_data.write.parquet(my_predictions_path, mode="overwrite")


But I don't like this for 2 reasons:

  1. I am not completely certain that order is maintained. The link suggests that it should be, but I do not understand why.
  2. It sometimes crashes, even though I am forcing a repartitioning as show above, with the following error when I try to write the data via that last line above:

Caused by: org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition


I do not want to use the monotonically_increasing_id suggestion sometimes given because my dataset is too large to allow for this.


It seems so fundamental: how can I report any model quality without being able to compare predictions with original targets. How do others do this??

Mike Williamson
  • 4,915
  • 14
  • 67
  • 104

1 Answers1

5

When calling model = <your ml-algorithm>.fit(df_train) the train dataset can have any number of additional columns. Only the column that contains the features and labels will be used for training the model (usually called features and label, that is configurable), but additional columns can be present.

When calling predictions = model.transform(df_test) on the trained model in the next step, a dataframe is returned that has the additional columns prediction, probability and rawPrediction.

Especially the original feature column and the label column is still part of the dataframe. Furthermore, any column that was part of df_test is still available in the output and can be used to identify the row.

prediction = model.transform(df_test)
prediction.printSchema()

prints

root
 |-- feature1: double (nullable = true)
 |-- feature2: double (nullable = true)
 |-- feature3: double (nullable = true)
 |-- label: double (nullable = true)
 |-- additional_data: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

if df_test contains not only the required column features but also the other columns including label. By evaluating label and prediction one could now for example create BinaryClassificationMetrics.

Calling model.transform is a technically a Dataset.withColumn call.


An example based on the ML Pipeline example from the Spark docs: the Spark ML workflow usually starts with a dataframe containing the training data, features and labels (=target values). In this example, there is also an additional column present that is irrelevant for the ml process.

training_original = spark.createDataFrame([
    (0.0, 1.1, 0.1, 1.0, 'any random value that is not used to train the model'),
    (2.0, 1.0, -1.0, 0.0, 'another value'),
    (2.0, 1.3, 1.0, 0.0, 'value 3'),
    (0.0, 1.2, -0.5, 1.0, 'this value is also not used for training nor testing')],  
    ["feature1", "feature2", "feature3", "label", "additional_data"])

Then a transformer is used to combine the features into a single column. The easiest transformer for this task is a VectorAssembler

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "feature3"],
    outputCol="features")
training_transformed = assembler.transform(training_original)
#+--------+--------+--------+-----+--------------------+--------------+          
#|feature1|feature2|feature3|label|     additional_data|      features|
#+--------+--------+--------+-----+--------------------+--------------+
#|     0.0|     1.1|     0.1|  1.0|any random value ...| [0.0,1.1,0.1]|
#| ...

The model can now be trained on this dataframe, using the columns features and label. The additional columns are present but will be ignored by the fit method.

lr = LogisticRegression(maxIter=10, regParam=0.01)
model = lr.fit(training_transformed)

Now the model is tested against the test data. The preparation is the same as for the training data:

test_df = spark.createDataFrame([
    (-1.0, 1.5, 1.3, 1.0, 'test value 1'),
    (3.0, 2.0, -0.1, 0.0, 'another test value'),
    (0.0, 2.2, -1.5, 1.0, 'this is not important')],
    ["feature1", "feature2", "feature3", "label", "additional_data"])
test_df_transformed = assembler.transform(test_df)
#+--------+--------+--------+-----+--------------------+--------------+
#|feature1|feature2|feature3|label|     additional_data|      features|
#+--------+--------+--------+-----+--------------------+--------------+
#|    -1.0|     1.5|     1.3|  1.0|        test value 1|[-1.0,1.5,1.3]|
#| ...

Running the ML magic produces

prediction = model.transform(test_df_transformed)
#+--------+--------+--------+-----+--------------------+--------------+--------------------+--------------------+----------+
#|feature1|feature2|feature3|label|     additional_data|      features|       rawPrediction|         probability|prediction|
#+--------+--------+--------+-----+--------------------+--------------+--------------------+--------------------+----------+
#|    -1.0|     1.5|     1.3|  1.0|        test value 1|[-1.0,1.5,1.3]|[-6.5872014439355...|[0.00137599470692...|       1.0|
#| ...

This dataframe now contains the original input data (feature1 to feature3 and additional_data), the expected target values (label), the transformed features (features) and the result predicted by the model (prediction). This is the place where all input values, the target values and the predictions are available in one dataset. Here would be the place to evaluate the model and calculate the desired metrics for the model. Applying the model on new data would give the same result (but without the label column of course).

werner
  • 13,518
  • 6
  • 30
  • 45
  • Hi @werner, I guess I was not clear, because you just proved my challenge. The DF of input values can only contain the features. My original dataset, *before* transforming it to satisfy the needs of Spark MLLib `model.fit`, **also contained the target values**. So now I need to combine the original dataset with the target values with the dataset with predictions. With any decent-sized dataset, I cannot find a way to join datasets. The best suggestion I have had so far is to [push to CSV and do a CLI `paste`](https://stackoverflow.com/q/63727512/534238) command, which seems crazy. – Mike Williamson Sep 08 '20 at 06:50
  • 1
    @MikeWilliamson additional columns are no problems for `model.fit` and `model.transform`. The input datasets are not restricted to the features and labels, they can contain whatever columns are there. The idea of the Spark ML workflow is to continously add new columns to the existing dataset without dropping columns on the way. I have added some more details to the example, I hope it is better understandable now what I meant. – werner Sep 08 '20 at 18:57
  • Of course. I was being dumb and thought I could **only** provide a DF that contained **only** the features. I got it to work with your suggestion of keeping the additional columns. For reasons that I still do not understand, it took **forever** to write the predictions to disk (many hours), whereas it was only a few minutes to write the input dataset to disk. (Same partitioning scheme.) Thanks so much! – Mike Williamson Sep 11 '20 at 10:37