Problems with gmdistribution.fit

Question

I'm trying to do clustering with gm. I tried this code:

opts = statset('MaxIter', 300, 'Display', 'iter');

gm = gmdistribution.fit(braindata, nsegments, 'Regularize', 1e-6, 'Options', opts);

where braindata is a data matrix(voxel*protein, 1478071*11) and nsegments is 8.

And I got this error:

Error using gmdistribution.fit (line 136) The following column(s) of data are effectively constant: 6 7 8 9 10 11.

Error in reducedSegbrain_gmix (line 119) gm = gmdistribution.fit(braindata, nsegments, 'Regularize', 1e-6, 'Options', opts);

Is there any workaround to this?

You only have the one error there really; the second `Error` line is part of the call stack telling you where in `reducedSegbrain_gmix` the failing call to `gmdistribution.fit`was found. — xenoclast, Dec 10 '14 at 17:20

score 1 · Answer 1 · answered Dec 10 '14 at 17:18

For me the best option was to discard that column. It was OK in my application, but may not be for you.

Here's the bit of the gmdistribution class definition that checks for that condition and produces the error:

        varX = var(X);
        I = find(varX < eps(max(varX))*n);
        if ~isempty(I)
            error('stats:gmdistribution:ZeroVariance',...
                'The following column(s) of data are effectively constant: %s.', num2str(I));
        end

where X is the multivariate data passed to the fit method. Its test for 'effectively zero variance' is a combination of eps, which is a measure of the smallest difference representable by the current datatype (such as uint8 or double) and the number of rows in your data.

So one approach is to reimplement that test and do something about it before gmdistribution.fit throws the error. If the variance of the data is so low that it's considered zero then there's nothing to be gained from its inclusion and thus there's no harm in discarding that column and carrying on fitting with the ones that are left.

From the looks of your example that would be half your dataset. This may not be ideal, but it's not uncommon in multivariate analysis to find a subset of your variables contain the majority of the variance (cf Pareto). You could do principal component analysis first to discard some of those prior to the gmm fit, though the above test is effectively doing that already.

If you absolutely have to include those columns then you may be able to do some other processing on them to raise the variance. First I would make sure that the values are being stored in a datatype that has enough precision to represent them, though that's usually handled fairly well automatically by MATLAB.

If the mean values of these low-variance columns are enough orders of magnitude different from the other columns (note that the above test is relative to the eps of the maximum of all the columns' variances) then that will give rise to a relative disparity which you might be able to reduce with some judicious normalisation.

And if all that fails then maybe you have to go back to the acquisition source and improve the SNR. If that's an MRI machine then I wish you the best of luck...

Problems with gmdistribution.fit

1 Answers1