When calling model = <your ml-algorithm>.fit(df_train)
the train dataset can have any number of additional columns. Only the column that contains the features and labels will be used for training the model (usually called features
and label
, that is configurable), but additional columns can be present.
When calling predictions = model.transform(df_test)
on the trained model in the next step, a dataframe is returned that has the additional columns prediction
, probability
and rawPrediction
.
Especially the original feature column and the label column is still part of the dataframe. Furthermore, any column that was part of df_test
is still available in the output and can be used to identify the row.
prediction = model.transform(df_test)
prediction.printSchema()
prints
root
|-- feature1: double (nullable = true)
|-- feature2: double (nullable = true)
|-- feature3: double (nullable = true)
|-- label: double (nullable = true)
|-- additional_data: string (nullable = true)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)
if df_test
contains not only the required column features
but also the other columns including label
. By evaluating label
and prediction
one could now for example create BinaryClassificationMetrics.
Calling model.transform
is a technically a Dataset.withColumn call.
An example based on the ML Pipeline example from the Spark docs: the Spark ML workflow usually starts with a dataframe containing the training data, features and labels (=target values). In this example, there is also an additional column present that is irrelevant for the ml process.
training_original = spark.createDataFrame([
(0.0, 1.1, 0.1, 1.0, 'any random value that is not used to train the model'),
(2.0, 1.0, -1.0, 0.0, 'another value'),
(2.0, 1.3, 1.0, 0.0, 'value 3'),
(0.0, 1.2, -0.5, 1.0, 'this value is also not used for training nor testing')],
["feature1", "feature2", "feature3", "label", "additional_data"])
Then a transformer is used to combine the features into a single column. The easiest transformer for this task is a VectorAssembler
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "feature3"],
outputCol="features")
training_transformed = assembler.transform(training_original)
#+--------+--------+--------+-----+--------------------+--------------+
#|feature1|feature2|feature3|label| additional_data| features|
#+--------+--------+--------+-----+--------------------+--------------+
#| 0.0| 1.1| 0.1| 1.0|any random value ...| [0.0,1.1,0.1]|
#| ...
The model can now be trained on this dataframe, using the columns features
and label
. The additional columns are present but will be ignored by the fit
method.
lr = LogisticRegression(maxIter=10, regParam=0.01)
model = lr.fit(training_transformed)
Now the model is tested against the test data. The preparation is the same as for the training data:
test_df = spark.createDataFrame([
(-1.0, 1.5, 1.3, 1.0, 'test value 1'),
(3.0, 2.0, -0.1, 0.0, 'another test value'),
(0.0, 2.2, -1.5, 1.0, 'this is not important')],
["feature1", "feature2", "feature3", "label", "additional_data"])
test_df_transformed = assembler.transform(test_df)
#+--------+--------+--------+-----+--------------------+--------------+
#|feature1|feature2|feature3|label| additional_data| features|
#+--------+--------+--------+-----+--------------------+--------------+
#| -1.0| 1.5| 1.3| 1.0| test value 1|[-1.0,1.5,1.3]|
#| ...
Running the ML magic produces
prediction = model.transform(test_df_transformed)
#+--------+--------+--------+-----+--------------------+--------------+--------------------+--------------------+----------+
#|feature1|feature2|feature3|label| additional_data| features| rawPrediction| probability|prediction|
#+--------+--------+--------+-----+--------------------+--------------+--------------------+--------------------+----------+
#| -1.0| 1.5| 1.3| 1.0| test value 1|[-1.0,1.5,1.3]|[-6.5872014439355...|[0.00137599470692...| 1.0|
#| ...
This dataframe now contains the original input data (feature1
to feature3
and additional_data
), the expected target values (label
), the transformed features (features
) and the result predicted by the model (prediction
). This is the place where all input values, the target values and the predictions are available in one dataset. Here would be the place to evaluate the model and calculate the desired metrics for the model. Applying the model on new data would give the same result (but without the label
column of course).