2

I have 1D data (on column data). I used Gaussian Mixture Model (GMM) as a density estimation, using this implementation in Python: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. By relying on AIC/BIC criteron i was able to determine number of components. After i fit the GMM, i plotted kernel density estimation of original observation + that of sampled data drawn from GMM. the plot of original and sampled desnities are quiet similar( that is good). But, i would like some metrics to report how good is the fitted model.

g = GaussianMixture(n_components = 35)

data= df['x'].values.reshape(-1,1) # data taken from data frame (10,000 data pints)
clf= g.fit(data)# fit model

samples= clf.sample(10000)[0] # generate sample data points (same # as original data points)

I found score in the implementation, but not sure how to implememnt. Am i doing it wrong? or is there any better way to show how accuracy is the fitted model, apart from histogram or kernel densities plots?.

print(clf.score(data))
print(clf.score(samples))
MWH
  • 353
  • 1
  • 3
  • 18

2 Answers2

2

You can use normalized_mutual_info_score, adjusted_rand_score or silhouette score to evaluate your clusters. All of these metrics are implemented under sklearn.metrics section.

EDIT: You can check this link for more detail explanations.

In a summary:

  • Adjusted Rand Index: measures the similarity of the two assignments.

  • Normalized Mutual Information: measures the agreement of the two assignments.

  • Silhouette Coefficient: measures how well-assigned each individual point is.
gmm.fit(x_vec)

pred = gmm.predict(x_vec)

print ("gmm: silhouttte: ", silhouette_score(x_vec, pred))
Batuhan B
  • 1,835
  • 4
  • 29
  • 39
  • could you provide an example or how to apply it to my case and how to interpreter the results. the documentation of sklearn not always helpful for me. thanks – MWH Jan 20 '20 at 16:29
  • i think these metrics for classification task. What i am doing is a density estimation using Gaussian Mixture. But thanks anyway! – MWH Jan 20 '20 at 18:58
  • You can use the Silhouette metric for unsupervised task. It calculates how similar an object is to its own cluster compared to other cluster. @MWH – Batuhan B Jan 21 '20 at 07:51
  • thanks. i am geing error "ValueError: Expected 2D array, got 1D array instead:". does this supposed to work with data with one column (X) only?? i passed x "original data" and y "sampled data", both have same data points "rows". – MWH Jan 21 '20 at 11:36
  • @MWH I added a small example to my answer how you can use silhouette score metrics for your cluster. – Batuhan B Jan 21 '20 at 12:51
  • thanks it works now returning a value between -1 and 1. But from what i understand this score is to deteremine number of clusters, e.g., components in GMM. I already decided that using AIC/BIC creteria. Still do not get will this score is a goodness of fit one to show how well is the fitted GMM distribution? or just to parameterize the model? – MWH Jan 21 '20 at 13:41
  • To determine the number of clusters you can use elbow method. After clustering operation you can check silhouette score and try to understand your cluster structure. – Batuhan B Jan 21 '20 at 16:51
  • @MWH If the solution works for you, could you please vote it? – Batuhan B Jan 22 '20 at 08:55
  • i did, but still not sure if this the right answer to my case. I am using GMM as density estimation (not as clustering) to fit a distribution to data, following this tutorial https://nbviewer.jupyter.org/github/jakevdp/ESAC-stats-2014/blob/master/notebooks/05.3-Density-GMM.ipynb. If you explain why this solution would work for me as an evaluation metric or add more details, i might accept the answer. Thanks – MWH Jan 22 '20 at 11:52
1

I would better use cross-validation and try to see the accuracy of the trained model.

Use the predict method of the fitted model to predict the labels of unseen data (use cross-validation and report the acurracy): https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.predict

Toy example:

g = GaussianMixture(n_components = 35)
g.fit(train_data)# fit model
y_pred = g.predict(test_data)

EDIT:

There are several options to measure the performance of your unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • to best of my knowledge, cross validation works only if we have 2D, e.g., X and y, we pass X, predict y and compare it with original y (tehn calculate mean absolut error etc). In my case, i only have X, so when i sample i got completely different numbers, possibily. Correct me if i am wrong! – MWH Jan 20 '20 at 16:16
  • I see your point. In case you use GMM as an unsupervised model, then my answer cannot be used. In other words, if you do not have labels (`y`), you cannot use this proposed cross-validation scheme. see my edit – seralouk Jan 20 '20 at 16:19
  • Thanks. I already used AIC/BIC to guide me on # of componentes. But i am looking for evaluating the accuracy, i only found score which i do not understand well. – MWH Jan 20 '20 at 16:26