0

I'm trying to classify a testset using GMM. I have a trainset (n*4 matrix) with labels {1,2,3}, n means the number of training examples, which have 4 properties. And I also have a testset (m*4) to be classified.

My goal is to have a probability matrix (m*3) for each testing example giving each label P(x_test|labels). Just like soft clustering.

first, I create a GMM with k=9 components over the whole trainset. I know in some papers, the author create a GMM for each label in trainset. But I want to deal with the data from all of the classes.

    GMModel = fitgmdist(trainset,k_component,'RegularizationValue',0.1,'Start','plus');

My problem is, I want to confirm the relationship P(component|labels)between components and labels. So I write a code as below, but not sure if it's right,

    idx_ex_of_c1 = find(trainset_label==1);
    idx_ex_of_c2 = find(trainset_label==2);
    idx_ex_of_c3 = find(trainset_label==3);

    [~,~,post] = cluster(GMModel,trainset);
    cita_c_k = zeros(3,k_component);
    for id_k = 1:k_component
        cita_c_k(1,id_k) = sum(post(idx_ex_of_c1,id_k))/numel(idx_ex_of_c1);
        cita_c_k(2,id_k) = sum(post(idx_ex_of_c2,id_k))/numel(idx_ex_of_c2);
        cita_c_k(3,id_k) = sum(post(idx_ex_of_c3,id_k))/numel(idx_ex_of_c3);
    end

cita_c_k is a (3*9) matrix to store the relationships. idx_ex_of_c1 is the index of examples, whose label is '1' in the trainset.

For the testing process. I first apply the GMModel to testset

    [P,~] = posterior(GMModel,testset); % P is a m*9 matrix 

And then, sum all components,

    P_testset = P*cita_c_k';
    [a,b] = max(P_testset,3);
    imagesc(b);

The result is ok, But not good enough. Can anyone give me some tips?

Thanks!

Ander Biguri
  • 35,140
  • 11
  • 74
  • 120
  • why did you choose 9 components for your GMM? More is not necessarily better, in fact I've seen extremely complex training sets using only 5-7 components. – GameOfThrows May 23 '16 at 15:16
  • I'm not sure how many components can perform best. Maybe you are right. I'm going to find out the optimal #components. – Zhiyu Huang May 28 '16 at 20:50

1 Answers1

0

You can take following steps:

  1. Increase target error and/or use optimal network size in training, but over-training and network size increase usually won't help

  2. Most important, shuffle training data while training and use only important data points for a label to train (ignore data points that may belong to more than one labels)

SEPARABILITY

Verify separability of data using properties using correlation.

  1. Correlation of all data in a label (X) should be high (near to one)
  2. Cross-correlation of all data in label (X) with data in label (!=X) should be low (near zero).

If you observe that data points in a label have low correlation and data points across labels have high correlation - It puts a question on selection of properties (there could be properties which actually won't make data separable). Being so do follows:

  1. Add more relevant properties to data points and remove less relevant properties (technique to use this is PCA)
  2. Use derived parameters like top frequency component etc. from data points to train rather than direct points
  3. Use a time delay network to train time series (always)
SACn
  • 1,862
  • 1
  • 14
  • 29