I am trying to cluster with kmeans in pyspark. I have data like the id_predictions_df example below. I'm first pivoting the data to create a dataframe where the columns are the id_y indices and the rows would be the id_x. The values are then the adj_prob. there's only one entry per row so the '.agg({'adj_prob':'max'})' is just to get the pivot to work. The pivot step is very slow, the clustering actually runs pretty quickly. Is there a quicker alternative to the pivot step? Pivoting just seems unneccessary since I'm turning it into a vector in the next step.
code:
pivot_df = id_predictions_df.groupBy('id_x').pivot('id_y').agg({'adj_prob':'max'})
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
feat_cols = [x for x in pivot_df2.columns if x!='id_x']
vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')
final_data = vec_assembler.transform(pivot_df2)
kmeans3 = KMeans(featuresCol='features',k=200)
model_k3 = kmeans3.fit(final_data)
cluster_label_df=model_k3.transform(final_data)
data:
id_predictions_df.show(truncate=False)
+-----+-------+--------+
|id_x |id_y |adj_prob|
+-----+-------+--------+
|388 |185750 |0.0 |
|8465 |15826 |0.0 |
|8712 |520418 |0.0 |
|11139|400617 |0.0 |
|12999|42364 |0.0 |
|13382|14100 |0.0 |
|15479|1075409|0.0 |
|15582|721538 |0.0 |
|16162|103031 |0.0 |
|17418|1129613|0.0 |
|18183|490223 |0.0 |
|20730|208942 |0.0 |
|23773|625286 |0.0 |
|26148|258915 |0.0 |
|29685|995242 |0.0 |
|29786|753786 |0.0 |
|30336|411385 |0.0 |
|32624|1290430|0.0 |
|33217|1194822|0.0 |
|34730|1006203|0.0 |
+-----+-------+--------+