I have a data frame of million records and I have used pyspark
ml
.
KMeans to identify clusters , Now I want to find the within set sum of squares error((WSSSE) for the number of clusters that I have used.
my spark version is 1.6.0 and computeCost is not available in pyspark ml till spark 2.0.0 ,So I have to make it on my own .
I have used this method to find squared error but its taking long time to give me the output .I am looking for a better way to find WSSSE.
check_error_rdd = clustered_train_df.select(col("C5"),col("prediction"))
c_center = cluster_model.stages[6].clusterCenters()
check_error_rdd = check_error_rdd.rdd
print math.sqrt(check_error_rdd.map(lambda row:(row.C5- c_center[row.prediction])**2).reduce(lambda x,y: x+y) )
clustered_train_df
is my original training data after fitting a ML PIPELINE,C5
is the featuresCol
in KMeans
.
check_error_rdd
looks like below:
check_error_rdd.take(2)
Out[13]:
[Row(C5=SparseVector(18046, {2398: 1.0, 17923: 1.0, 18041: 1.0, 18045: 0.19}), prediction=0),
Row(C5=SparseVector(18046, {1699: 1.0, 17923: 1.0, 18024: 1.0, 18045: 0.91}), prediction=0)]
c_center
is the list of cluster centres where every centre is a list of length 18046:
print len(c_center[1])
18046