2

I have trained a classification model using pyspark.ml.classification.RandomForestClassifier and applied it on a new dataset for prediction. I am removing the customer_id column before feeding the dataset to the model but not sure how to map the customer_id back after prediction. So, there is no way for me to identify which row belongs to which customer as Spark dataframes are inherently unordered.

thePurplePython
  • 2,621
  • 1
  • 13
  • 34
Mrinal
  • 1,826
  • 2
  • 19
  • 31
  • 2
    why are you removing the id col in the first place? – thePurplePython Sep 16 '19 at 18:20
  • because id column doesn't add any value to the model (as per my knowledge and I am new to the DS realm) – Mrinal Sep 16 '19 at 18:24
  • 1
    In contrast to most other frameworks, Pyspark ML algos need the features assembled under a single column, referenced as `featuresCol` in the arguments; normally, you should have already done this using `VectorAssembler` (as shown in Step 1 [here](https://stackoverflow.com/questions/47585723/kmeans-clustering-in-pyspark/47593712#47593712)) before fitting your RF model, without including the ID column which stays as-is (doesn't affect the model). If not, please *show some details*... – desertnaut Sep 16 '19 at 19:40
  • Looks like I kind of answered my own question unknowingly. Thank you guyz! @thePurplePython if you could write an answer, I could accept and close this question – Mrinal Sep 17 '19 at 02:30

1 Answers1

1

Here is a nice spark doc example of classification using pipeline technique where the original schema is preserved and only the selected cols are used as input features to the learning algorithm (ex: I replaced with random forest).

reference => https://spark.apache.org/docs/latest/ml-pipeline.html

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and rf.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, rf])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)

# schema is preserved
prediction.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

# sample row
for i in prediction.take(1): print(i)

Row(id=4, text='spark i j k', words=['spark', 'i', 'j', 'k'], features=SparseVector(262144, {20197: 1.0, 24417: 1.0, 227520: 1.0, 234657: 1.0}), rawPrediction=DenseVector([5.0857, 4.9143]), probability=DenseVector([0.5086, 0.4914]), prediction=0.0)

Here is a nice spark doc example of the VectorAssembler class where multiple cols are combined as input features which would be input to the learning algorithm.

reference => https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
    ["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)

Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+-----------------------+-------+
|features               |clicked|
+-----------------------+-------+
|[18.0,1.0,0.0,10.0,0.5]|1.0    |
+-----------------------+-------+
thePurplePython
  • 2,621
  • 1
  • 13
  • 34