I understand that GMM is not a classifier itself, but I am trying to follow the instructions of some users in this stack exchange post below to create a GMM-inspired classifier.
lejlot: Multiclass classification using Gaussian Mixture Models with scikit learn
"construct your own classifier where you fit one GMM per label and then use assigned probability to do actual classification. Then it is a proper classifier"
What is meant by "assigned probability" for GMM Matlab objects in the above quote and how can we input a new point to get our desired assigned probability? For a new point that we are trying to classify, my understanding is that we need to get the posterior probabilities that the new point belongs to either Gaussian and then compare these two probabilities.
It looks from the documentation https://www.mathworks.com/help/stats/gmdistribution.html like we only have access to cluster center mu's and covariance matrices (sigma) but not an actual probability distribution that would take in a point and spit out a probability
podludek: Multiclass classification using Gaussian Mixture Models with scikit learn
- "GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem.....You should use GMM as a posterior distribution, one GMM per each class." -
In the documentation in Matlab for posterior(gm,X), the tutorial shows us inputting X, which is already the the data we used to create ("train") our GMM. But how can we get the posterior probability of being in a cluster for a new point?
https://www.mathworks.com/help/stats/gmdistribution.posterior.html
"P = posterior(gm,X)
returns the posterior probability of each Gaussian mixture component in gm given each observation in X"
--> But the X used in the link above is the 'training' data used to create the GMM itself, not a new point. Also we have two gm objects, not one. How can we grab the probability a point belongs to a Gaussian?
The pseudocode below is how I envisioned a GMM inspired classifier would go for a two class example: I would fit GMM's to individual clusters as described by podludek. Then, I would use the posterior probailities of a point being in each cluster and then pick the bigger probability.
I'm aware there are issues with this conceptually (such as the two GMM objects having conflicting covariance matrices) but I've been assured by my mentor that there is a way to make a supervised version of GMM, and he wants me to make one, so here we go:
Pseusdocode:
X % The training data matrix
% each new row is a new data point
% each column is new feature
% Ex: if you had 10,000 data points and 100 features for each, your matrix
% would be 10000 by 100
% Let's say we had 200 points of each class in our training data
% Grab subsets of X that corresponds to classes 1 and 2
X_only_class_2 = X(1:200,:)
X_only_class_1 = X(201:end,:)
gmfit_class_1 = fitgmdist(X_only_class_1,1,'RegularizationValue',0.1);
cov_matrix_1=gmfit_class_1.Sigma;
gmfit_class_2 = fitgmdist(X_only_class_2,1,'RegularizationValue',0.1);
cov_matrix_2=gmfit_class_2.Sigma;
% Now do some tests on data we already know the classification of to check if this is working as we would expect:
a = posterior(gmfit_class_1,X_only_class_1)
b = posterior(gmfit_class_1,X_only_class_2)
c = posterior(gmfit_class_2,X_only_class_1)
d = posterior(gmfit_class_2,X_only_class_2)
But unfortunately, computing these posteriors a, b, c
, and d
just result in column vectors of 1
's. I'm aware these are degenerate cases (and pointless for actual classification since we already know the classifications of our training data) but I still wanted to test them to make sure the posterior method is working as I would expect.
Expected:
a = posterior(gmfit_class_1,X_only_class_1)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
b = posterior(gmfit_class_1,X_only_class_2)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
c = posterior(gmfit_class_2,X_only_class_1)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
d = posterior(gmfit_class_2,X_only_class_2)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
I have to think that somehow Matlab is being confused by how in both gmm fit models, there is only one cluster in each. Either that or I am not interpreting the posterior method correctly.