2

I've been using python to experiment with sklearn's BayesianGaussianMixture (and with GaussianMixture, which shows the same issue).

I fit the model with a number of items drawn from a distribution, then tested the model with a held out data set (some from the distribution, some outside it).

Something like:

X_train = ... # 70x321 matrix
X_in = ... # 20x321 matrix of held out data points from X
X_out = ... # 20x321 matrix of data points drawn from a different distribution
model = BayesianGaussianMixture(n_components=1)
model.fit(X_train)
print(model.score_samples(X_in).mean())
print(model.score_samples(X_out).mean())

outputs:

-1334380148.57
-2953544628.45

The score_samples method returns a per-sample log likelihood of the given data, and "in" samples are much more likely than the "out" samples as expected - I'm just wondering why the absolute values are so high?

The documentation for score_samples states "Compute the weighted log probabilities for each sample" - but I'm unclear what the weights are based on.

Do I need to scale my input first? Is my input dimensionality too high? Do I need to do some additional parameter tuning? Or am I just misunderstanding what the method returns?

Dave Challis
  • 3,525
  • 2
  • 37
  • 65

1 Answers1

3

The weights are based on the mixture weights.

Do I need to scale my input first?

This is usually not a bad idea but I can't say not knowing more about your data.

Is my input dimensionality too high?

It seems given the amount of data you are fitting it actually is too high. Remember the curse of dimensionality. You have very few rows of data and 312 features, 1:4 ratio; that's not really going to work in practice.

Do I need to do some additional parameter tuning? Or am I just misunderstanding what the method returns?

Your outputs are log-probabilites that are very negative. If you raise e to such a large negative magnitude you get a probability that is very close to zero. Your results actually make sense from that perspective. You may want to check the log-probability in areas where you know there is a higher probability of being in that component. You may also want to check the covariances for each component to make sure you don't have a degenerate solution, which is quite likely given the amount of data and dimensionality in this case. Before any of that, you may want to get more data or see if you can reduce the number of dimensions.

I forgot to mention a rather important point: The output is for the Density so keep that in mind too.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
user264431
  • 31
  • 3