0

I used GMM from Scikit Learn package for clustering. The python code is here.

import pandas as pd
from numpy import unique
from numpy import where
from sklearn.mixture import GaussianMixture
from matplotlib import pyplot

#load data
rawData=pd.read_excel('ClusteringFailure.xlsx',0)
X=rawData.iloc[:, :].to_numpy(dtype='float64')

#define model and set number of clusters to 4 for genotyping
model = GaussianMixture(n_components=4)

#fit the model
model.fit(X)

#assign a cluster index to each data point
yCluster = model.predict(X)
clusters = unique(yCluster)
for cluster in clusters:
    row_ix = where(yCluster == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
Here is the data I used. 
x   y
18.586  46.33
0.109   68.534
0.074   5.242
22.212  63.888
3.726   36.767
0.159   6.98
24.531  9.925
0.143   0.299
29.91   54.539
29.868  12.522
0.064   2.6
29.978  48.665

I ran it multiple times and every time the clustering was different. Can anyone explain why it is not consistent and advise on how to improve the consistency? Thanks!

  • The mixture is fitted using the EM algorithm, which is non deterministic. Your data might also not be very suitable for a Gaussian mixture. could you provide the list in a cleaner manner, or scatter it so we could understand why the GM does not fit well ? – PlainRavioli Sep 19 '22 at 17:58
  • I have plotted your data, and it would indeed look very hard (even more with such little examples) to fit a 4 component Gaussian mixture on this data. Visually, on this 2D set, we can't identify any gaussian clusters – PlainRavioli Sep 19 '22 at 18:10
  • Thanks PlainRavioli for the advice! When I used a cleaner scatter data, it gave consistent clusters. My original question was why the clustering was not repeatable on the same set of data. Why is EM algo non deterministic? Is that because it starts with different initial condition so it may fit differently? Thanks! – user3673063 Sep 21 '22 at 20:54
  • Yes, the initialization plays a role, and the EM algorithm will converge towards a local minima, which does not insure the same result for all initialization. But check this out, this is well explained https://stats.stackexchange.com/questions/153254/convergence-of-em-for-mixture-of-gaussians – PlainRavioli Sep 22 '22 at 08:08

0 Answers0