I'm seeking help to know more information on the clusters generated using K-Means clustering algorithm in Spark MLIB.
By the end of the below code snippet, we have a K-Means Model in the value clusters.
val data = List((0.0, 0.0, 0.0),(0.1, 0.1, 0.1),(0.2, 0.2, 0.2),(9.0, 9.0, 9.0))
val dataRDD = sc.parallelize(data)
val parsedData = dataRDD.map(s => Vectors.dense(Array(s._1, s._2, s._3)))
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// clusters.clusterCenters - used to access cluster centers
I can predict the cluster id for a test data point using predict and cluster centers using clusters.clusterCenters. But can i know the data points under each cluster?
For example : I would want this information.
Cluster1 has the following data points:
(0.0, 0.0, 0.0)
(0.2, 0.2, 0.2)
Cluster 2 has the following data points:
(0.1, 0.1, 0.1)
(9.0, 9.0, 9.0)
One way to do this is to find cluster id for each of training data points using Predict method. But is there a better way to do this because the cluster already has data points in it?
Your help will be greatly appreciated. Thank you.