0

I'm working on a time series dataset and therefore while fitting the GaussianMixture() function from the scikit-learn package, I need to make each feature(timestamp) dependent. However, I don't find a parameter to customize the covariance matrix after examining the source code.

With my limited statistics knowledge, I'm curious how I can modify the covariance matrix during the E-step to incorporate time dependency into GMM model. Thank you very much.

Here is the Source Code: The change I want to make is in the estimate_gaussian_parameters() function https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/mixture/gaussian_mixture.py#L435

TylerH
  • 20,799
  • 66
  • 75
  • 101
Cocoa Wang
  • 23
  • 8
  • I'm not sure if you should be modifying the source code of sklearn directly. Is this what you're trying to achieve? https://stats.stackexchange.com/questions/152002/mixture-model-with-dependant-observations – darksky Feb 18 '19 at 07:28
  • The problem is the same problem I have, but rather than introducing an autoregressive property, I would like to make it explicit in the covariance matrix, in other words, the covariance matrix shouldn't be diagonal, but I'm not sure how – Cocoa Wang Feb 18 '19 at 16:56
  • From their [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html), `GaussianMixture()` has a parameter called `covariance_type`, which takes values `'full' (default), 'tied', 'diag', and 'spherical'`. See the link for more details. – darksky Feb 19 '19 at 00:36

1 Answers1

0

With darksky's help, I learned the function is built-in with the option of covariance-matrix. The parameter covariance_type has 4 options: 'full' (each component has its own general covariance matrix), 'tied' (all components share the same general covariance matrix), 'diag' (each component has its own diagonal covariance matrix), 'spherical' (each component has its own single variance).

In my understanding then, 'spherical' is used for uni-variant dataset,'diag' is used for datasets with multi-variant but independent features. Therefore, one should either use 'full' or 'tied' if they want to predict on multi-variant and dependent features.

Cocoa Wang
  • 23
  • 8