My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters.
This R Implementation
https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the R implementation
uses a Gibbs sampler, I'm not sure if that affects this.
What confuses me is the n_components
parameters. n_components: int, default 1 :
Number of mixture components.
If the number of components is determined by the data and the Dirichlet Process, then what is this parameter?
Ultimately, I'm trying to get:
(1) the cluster assignment for each sample;
(2) the probability vectors for each cluster; and
(3) the likelihood/log-likelihood for each sample.
It looks like (1) is the predict
method, and (3) is the score
method. However, the output of (1) is completely dependent on the n_components
hyperparameter.
My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was Dirichlet Process
in Scikit-learn
that I wanted to try out.
Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM
Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html
Here's my naive usage:
from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)