1

I'm seeking help to know more information on the clusters generated using K-Means clustering algorithm in Spark MLIB.

By the end of the below code snippet, we have a K-Means Model in the value clusters.

val data = List((0.0, 0.0, 0.0),(0.1, 0.1, 0.1),(0.2, 0.2, 0.2),(9.0, 9.0, 9.0))
val dataRDD = sc.parallelize(data)
val parsedData = dataRDD.map(s => Vectors.dense(Array(s._1, s._2, s._3)))
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// clusters.clusterCenters - used to access cluster centers

I can predict the cluster id for a test data point using predict and cluster centers using clusters.clusterCenters. But can i know the data points under each cluster?

For example : I would want this information.

Cluster1 has the following data points:
    (0.0, 0.0, 0.0)
    (0.2, 0.2, 0.2)
Cluster 2 has the following data points:
   (0.1, 0.1, 0.1)
   (9.0, 9.0, 9.0)

One way to do this is to find cluster id for each of training data points using Predict method. But is there a better way to do this because the cluster already has data points in it?

Your help will be greatly appreciated. Thank you.

0 Answers0