I try to apply a GMM clustering algorithm (as in https://spark.apache.org/docs/latest/ml-clustering.html) on a given DataFrame as following :
vector.show(1)
ID | Features
33.0 | [0.0,1.0,27043.0,....]
type(vector)
pyspark.sql.dataframe.DataFrame
type(vector.select('features'))
pyspark.sql.dataframe.DataFrame
vector.printSchema()
root
|-- id: double (nullable = true)
|-- features: vector (nullable = true)
Then I tried the following code to create the clusters :
from pyspark.ml.clustering import GaussianMixture
gmm = GaussianMixture().setK(5).setSeed(538009335).setFeaturesCol("features")
gmm_model = gmm.fit(vector)
gmm_model.gaussiansDF.show()
gmm_predictions = gmm_model.transform(vector)
gmm_predictions.show()
This works without any bugs or troubles but the algorithm finally returns the same mean and covariance for all clusters and assign every row/ID to the same cluster 0 (probabilities being always 0.2 for whatever cluster ([0.2,0.2,0,2,0.2,0.2])).
Would you know why it gives me such results back please ?
NB : The data are not responsible for this "bad" clustering : having tried Kmeans with Scikit-learn and PySpark, I get a "realistic" clustering with Scikit-learn.
Thank you in advance for your help.
Best regards