Gaussian Mixture Model (GMM) giving only one cluster

Question

I have a dataset that has 70 columns and 4.4 million rows. I want to perform clustering on it. I did TF-IDF first then I used clustering with K-means, Bisecting k-means and Gaussian Mixture Model (GMM). While the other techniques give me the specified number of clusters, GMM gives only one cluster. Example, in the code below, I want 20 clusters but it returns only 1 cluster. Is this happening because of the fact that I have many columns or it is merely caused by the nature of the data?

gmm = GaussianMixture(k = 20, tol = 0.000001, maxIter=10000, seed =1)
model = gmm.fit(rescaledData)
df1 = model.transform(rescaledData).select(['label','prediction'])
df1.groupBy('prediction').count().show()  # this returns 1 row

score 1 · Answer 1 · answered Mar 10 '21 at 09:08

In my opinion, the main reason behind of bad clustering performance of Pyspark GMM is that it's implementation is done using diagonal covariance matrix which do not take account of covariance between different features present within the dataset.

Check it's implementation here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

where they have cleary mentioned to be using diagonal covariance matrix because of curse of dimensionality.

@note This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.

Would you suggest PCA before GMM in that case to tackle the curse of dimensionality? — ricardo, Sep 10 '21 at 20:25

Gaussian Mixture Model (GMM) giving only one cluster

1 Answers1