how to name kmeans clusters in pyspark

Question

I have the following code:

%pyspark
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
(trainingData, testData) = dataFrame.randomSplit([0.7, 0.3])
assembler = VectorAssembler(inputCols = ["PetalLength", "PetalWidth", "SepalLength", "SepalWidth"], outputCol="features")
kmeans = KMeans().setK(3).setSeed(101010)
pipeline = Pipeline(stages=[assembler, kmeans])
modelKMeans = pipeline.fit(dataFrame)

And when I run this:

predictions = modelKMeans.transform(testData)
z.show(predictions)

I want to see in prediction column "Iris-setosa" instead of 0, "Iris-versicolor" instead of 1, and "Iris-virginica" instead of 2. Is it possible?

score 0 · Answer 1 · answered Jul 27 '18 at 15:41

KMeans is not a classification algorithm, it is a clustering algorithm. Therefore, it doesn't know what correspond to the clusters it makes. If you want "Iris-setosa" instead of 0, you must first check that your "Iris-setosa" group corresponds to 0. You can't do it beforehand. Then you can make a new column with your mapping :

groups = when(prediction==0,  "Iris-setosa") \
         .when(prediction==1,  "Iris-versicolor") \
         .when(prediction==2,  "Iris-virginica") \
         .otherwise(None)

how to name kmeans clusters in pyspark

1 Answers1