0

I am trying to cluster with kmeans in pyspark. I have data like the id_predictions_df example below. I'm first pivoting the data to create a dataframe where the columns are the id_y indices and the rows would be the id_x. The values are then the adj_prob. there's only one entry per row so the '.agg({'adj_prob':'max'})' is just to get the pivot to work. The pivot step is very slow, the clustering actually runs pretty quickly. Is there a quicker alternative to the pivot step? Pivoting just seems unneccessary since I'm turning it into a vector in the next step.

code:

pivot_df = id_predictions_df.groupBy('id_x').pivot('id_y').agg({'adj_prob':'max'})



from pyspark.ml.clustering import KMeans

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

feat_cols = [x for x in pivot_df2.columns if x!='id_x']

vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')

final_data = vec_assembler.transform(pivot_df2)

kmeans3 = KMeans(featuresCol='features',k=200)

model_k3 = kmeans3.fit(final_data)

cluster_label_df=model_k3.transform(final_data)

data:

id_predictions_df.show(truncate=False)


+-----+-------+--------+
|id_x |id_y   |adj_prob|
+-----+-------+--------+
|388  |185750 |0.0     |
|8465 |15826  |0.0     |
|8712 |520418 |0.0     |
|11139|400617 |0.0     |
|12999|42364  |0.0     |
|13382|14100  |0.0     |
|15479|1075409|0.0     |
|15582|721538 |0.0     |
|16162|103031 |0.0     |
|17418|1129613|0.0     |
|18183|490223 |0.0     |
|20730|208942 |0.0     |
|23773|625286 |0.0     |
|26148|258915 |0.0     |
|29685|995242 |0.0     |
|29786|753786 |0.0     |
|30336|411385 |0.0     |
|32624|1290430|0.0     |
|33217|1194822|0.0     |
|34730|1006203|0.0     |
+-----+-------+--------+
user3476463
  • 3,967
  • 22
  • 57
  • 117
  • It is not clear to me: since you said there is only one entry per row, I suppose the `id_y` are unique too. Why would you pivot the dataframe in the first place? – Ric S Jul 28 '21 at 07:36
  • @RicS thank you for getting back to me. I would pivot it so that I could feed the dataframe to the vector assembler, and then feed the resulting dataframe to kmeans. are you saying I can run kmeans in pyspark ml without putting the data through the vector assembler? – user3476463 Jul 28 '21 at 14:57
  • If your `id_y` variable contains all distinct values I don't understand why you would pivot it. For the other point, I'm afraid you need a VectorAssembler for using K-means – Ric S Jul 29 '21 at 07:12

0 Answers0