3

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):

gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);

I get the following outputs:

  iter      log-likelihood
   1    -6.66298e+07
   2    -1.87763e+07
   3    -5.00384e+06
   4    -1.11863e+06
   5          299767
   6          985834
   7     1.39525e+06
   8     1.70956e+06
   9     1.94637e+06

The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.

Could somebody help me?

Pedia
  • 1,432
  • 2
  • 11
  • 17
徐珍琦
  • 71
  • 5

1 Answers1

0

First of all, it is not a problem of how large your dataset is. Here is some code that produces similar results with a quite small dataset:

options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);

this produces

iter     log-likelihood
 1       64.4731
 2       73.4987
 3       73.4987

Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.

PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].

You can verify this by reducing the density in the above code

x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);

this produces

iter     log-likelihood
 1      -8.99083
 2      -3.06465
 3      -3.06465

So, I would rather assume that your dataset contains several duplicate (or very close) values.

Pedia
  • 1,432
  • 2
  • 11
  • 17