2

I try to apply a GMM clustering algorithm (as in https://spark.apache.org/docs/latest/ml-clustering.html) on a given DataFrame as following :

vector.show(1)

ID | Features

33.0 | [0.0,1.0,27043.0,....]

type(vector)

pyspark.sql.dataframe.DataFrame

type(vector.select('features'))

pyspark.sql.dataframe.DataFrame

vector.printSchema()

root

|-- id: double (nullable = true)

|-- features: vector (nullable = true)

Then I tried the following code to create the clusters :

from pyspark.ml.clustering import GaussianMixture
gmm = GaussianMixture().setK(5).setSeed(538009335).setFeaturesCol("features")
gmm_model = gmm.fit(vector)
gmm_model.gaussiansDF.show()
gmm_predictions = gmm_model.transform(vector)
gmm_predictions.show()

This works without any bugs or troubles but the algorithm finally returns the same mean and covariance for all clusters and assign every row/ID to the same cluster 0 (probabilities being always 0.2 for whatever cluster ([0.2,0.2,0,2,0.2,0.2])).

Would you know why it gives me such results back please ?

NB : The data are not responsible for this "bad" clustering : having tried Kmeans with Scikit-learn and PySpark, I get a "realistic" clustering with Scikit-learn.

Thank you in advance for your help.

Best regards

Olscream
  • 127
  • 1
  • 14
  • Try normalizing your data before clustering. I wouldn't be surprised if Spark has some numerical issues... – Has QUIT--Anony-Mousse May 31 '19 at 20:18
  • First of all thank you much for your help Anony-Mousse ! About your idea, I tried to L1 normalize the data (as described here:https://spark.apache.org/docs/2.2.0/ml-features.html#normalizer) and got back during the training phase : "breeze.linalg.NotConvergedException:" (like here : https://stackoverflow.com/questions/47340602/pyspark-pca-avoiding-notconvergedexception?rq=1) – Olscream Jun 03 '19 at 15:01
  • If I try to process the features through a MinMaxScaler, the clustering never finishes (even after 2h (whereas it takes only 5 minutes without any scaling techniques)). – Olscream Jun 03 '19 at 15:03
  • Have you already got this problem ? – Olscream Jun 03 '19 at 15:03
  • I encountered this problem today myself. K means in mllib gives me perfect clusters with silhouette scores. Tried both with standardized and non-std data, still get just one cluster with GMM. Have you solved the issue yet? – ricardo Sep 10 '21 at 19:37

1 Answers1

0

I think the main reason behind the bad clustering is that pyspark GMM uses diaganal covariance matrix only, as opposed to full covariance matrix. The diagonal covariance matrix do not account covariance between the features present within the dataset and thus may result into bad clustering.

You can check pyspark implemetation of GMM at: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

Where as when you Sklearn's implementation of GMM, by default they use full covariance matrix which involves covariance between each of the feature present as oppose to diagonal covariance matrix.