I am doing a clustering work on some datas and I would like to use the 'soft boundaries' of Gaussian Mixture Model (GMM) to connect each data points to multiple clusters at the same time but with different degrees of belief.
e.g. a data point can have a 60% of belonging to cluster 1, 40% of belonging to cluster 2.
I made a study on my dataset and I found that 3 clusters would be a good choice for these datas.
I used the GaussianMixture from sklearn for my work. When I fit the model then I use the predict_proba()
function on my datas the classification is done in such a way that I obtain probability very close or equal to 100% for a given cluster and for all my samples.
For example: if I use the predict_proba()
function for the sample X, I can have
this result : [0, 0, 1], meanning that the sample X belongs to cluster 3 with a probability of 1.
or this result : [2.89e-300, 1, 0], meanning that the the sample X belongs to cluster 2 with a probability of 1 and cluster 1 with a probability of 2.89e-300 ~ 0.
here is a the code and a screenshot of my results :
gmm = GMM(3, covariance_type='diag', max_iter = 100, verbose=1).fit(bert_features[1])
proba_predicted = gmm.predict_proba(bert_features[1][:])
print(proba_predicted)
I would like to allow my model to give results with more 'soft boundaries'. Is this a way to tune the model? Maybe change the covariance matrix to allow the samples to belong several clusters ?
Thank you in advance for your help