pyspark ml model map id column after prediction

Question

I have trained a classification model using pyspark.ml.classification.RandomForestClassifier and applied it on a new dataset for prediction. I am removing the customer_id column before feeding the dataset to the model but not sure how to map the customer_id back after prediction. So, there is no way for me to identify which row belongs to which customer as Spark dataframes are inherently unordered.

because id column doesn't add any value to the model (as per my knowledge and I am new to the DS realm) — Mrinal, Sep 16 '19 at 18:24
In contrast to most other frameworks, Pyspark ML algos need the features assembled under a single column, referenced as `featuresCol` in the arguments; normally, you should have already done this using `VectorAssembler` (as shown in Step 1 [here](https://stackoverflow.com/questions/47585723/kmeans-clustering-in-pyspark/47593712#47593712)) before fitting your RF model, without including the ID column which stays as-is (doesn't affect the model). If not, please *show some details*... — desertnaut, Sep 16 '19 at 19:40
Looks like I kind of answered my own question unknowingly. Thank you guyz! @thePurplePython if you could write an answer, I could accept and close this question — Mrinal, Sep 17 '19 at 02:30

score 1 · Accepted Answer · answered Sep 17 '19 at 02:43

Here is a nice spark doc example of classification using pipeline technique where the original schema is preserved and only the selected cols are used as input features to the learning algorithm (ex: I replaced with random forest).

reference => https://spark.apache.org/docs/latest/ml-pipeline.html

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and rf.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, rf])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)

# schema is preserved
prediction.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

# sample row
for i in prediction.take(1): print(i)

Row(id=4, text='spark i j k', words=['spark', 'i', 'j', 'k'], features=SparseVector(262144, {20197: 1.0, 24417: 1.0, 227520: 1.0, 234657: 1.0}), rawPrediction=DenseVector([5.0857, 4.9143]), probability=DenseVector([0.5086, 0.4914]), prediction=0.0)

Here is a nice spark doc example of the VectorAssembler class where multiple cols are combined as input features which would be input to the learning algorithm.

reference => https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
    ["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)

Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+-----------------------+-------+
|features               |clicked|
+-----------------------+-------+
|[18.0,1.0,0.0,10.0,0.5]|1.0    |
+-----------------------+-------+

I did all these things, was only confused about the id. Thanks anyway! — Mrinal, Sep 17 '19 at 03:19

pyspark ml model map id column after prediction

1 Answers1