8

My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters.

This R Implementation https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the R implementation uses a Gibbs sampler, I'm not sure if that affects this.

What confuses me is the n_components parameters. n_components: int, default 1 : Number of mixture components. If the number of components is determined by the data and the Dirichlet Process, then what is this parameter?


Ultimately, I'm trying to get:

(1) the cluster assignment for each sample;

(2) the probability vectors for each cluster; and

(3) the likelihood/log-likelihood for each sample.

It looks like (1) is the predict method, and (3) is the score method. However, the output of (1) is completely dependent on the n_components hyperparameter.

My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was Dirichlet Process in Scikit-learn that I wanted to try out.


Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

Here's my naive usage:

from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)
O.rka
  • 29,847
  • 68
  • 194
  • 309
  • don't really know about this kind of model, but in the doco, they call `n_components` a "truncation paramater", so i guess the number of components is determined by the data, but you have to specify an upper bound. – maxymoo Aug 22 '16 at 23:01
  • Oh it's an upper bound? I tried adding the max number of samples I have (42) and I ended up w/ 42 clusters. I think it might be forcing them into that number of clusters. When I did Gibbs sampling w/ the R implementation listed above for 2000 iterations, I got 3 clusters. – O.rka Aug 22 '16 at 23:08
  • 1
    not sure, maybe have a play around with some of the other paramaters like `convariance_type`, `alpha` etc? – maxymoo Aug 22 '16 at 23:18
  • @maxymoo i'm going to mess around w/ it today and let you know. thanks for the suggestions. – O.rka Aug 23 '16 at 15:53

2 Answers2

5

As mentioned by @maxymoo in the comments, n_components is a truncation parameter.

In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn's DP-GMM, a new data point joins an existing cluster k with probability |k| / n-1+alpha and starts a new cluster with probability alpha / n-1 + alpha. This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters.

Unlike R's implementation that uses Gibbs sampling, sklearn's DP-GMM implementation uses variational inference. This can be related to the difference in results.

A gentle Dirichlet Process tutorial can be found here.

rafaelvalle
  • 6,683
  • 3
  • 34
  • 36
2

Now the Class DPGMM is decrecated. as the warning show: DeprecationWarning: Class DPGMM is deprecated; The DPGMM class is not working correctly and it's better to use sklearn.mixture.BayesianGaussianMixture class with parameter weight_concentration_prior_type='dirichlet_process' instead. DPGMM is deprecated in 0.18 and will be removed in 0.20.

Xiang Zhang
  • 241
  • 1
  • 7