How to estimate the parameters of a mixture model in OpenTURNS?

Question

I would like to estimate the parameters of a mixture model of normal distributions in OpenTURNS (that is, the distribution of a weighted sum of Gaussian random variables). OpenTURNS can create such a mixture, but it cannot estimate its parameters. Moreover, I need to create the mixture as an OpenTURNS distribution in order to propagate uncertainty through a function.

For example, I know how to create a mixture of two normal distributions:

import openturns as ot
mu1 = 1.0
sigma1 = 0.5
mu2 = 3.0
sigma2 = 2.0
weights = [0.3, 0.7]
n1 = ot.Normal(mu1, sigma1)
n2 = ot.Normal(mu2, sigma2)
m = ot.Mixture([n1, n2], weights)

In this example, I would like to estimate mu1, sigma1, mu2, sigma2 on a given sample. In order to create a working example, it is easy to generate a sample by simulation.

s = m.getSample(100)

score 2 · Accepted Answer · edited Sep 03 '20 at 20:23

You can rely on scikit-learn's GaussianMixture to estimate the parameters and then use them to define a Mixture model in OpenTURNS.

The script hereafter contains a Python class MixtureFactory that estimates the parameters of a scikitlearn GaussianMixture and outputs an OpenTURNS Mixture distribution:

from sklearn.mixture import GaussianMixture
from sklearn.utils.validation import check_is_fitted
import openturns as ot
import numpy as np

class MixtureFactory(GaussianMixture):
    """
    Representation of a Gaussian mixture model probability distribution.

    This class allows to estimate the parameters of a Gaussian mixture
    distribution using scikit algorithms & provides openturns Mixture object.
    
    Read more in scikit learn user guide & openturns theory.

    Parameters:
    -----------
    n_components : int, defaults to 1.
        The number of mixture components.

    covariance_type : {'full' (default), 'tied', 'diag', 'spherical'}
        String describing the type of covariance parameters to use.
        Must be one of:

        'full'
            each component has its own general covariance matrix
        'tied'
            all components share the same general covariance matrix
        'diag'
            each component has its own diagonal covariance matrix
        'spherical'
            each component has its own single variance

    tol : float, defaults to 1e-3.
        The convergence threshold. EM iterations will stop when the
        lower bound average gain is below this threshold.

    reg_covar : float, defaults to 1e-6.
        Non-negative regularization added to the diagonal of covariance.
        Allows to assure that the covariance matrices are all positive.

    max_iter : int, defaults to 100.
        The number of EM iterations to perform.

    n_init : int, defaults to 1.
        The number of initializations to perform. The best results are kept.

    init_params : {'kmeans', 'random'}, defaults to 'kmeans'.
        The method used to initialize the weights, the means and the
        precisions.
        Must be one of::

            'kmeans' : responsibilities are initialized using kmeans.
            'random' : responsibilities are initialized randomly.

    weights_init : array-like, shape (n_components, ), optional
        The user-provided initial weights, defaults to None.
        If it None, weights are initialized using the `init_params` method.

    means_init : array-like, shape (n_components, n_features), optional
        The user-provided initial means, defaults to None,
        If it None, means are initialized using the `init_params` method.

    precisions_init : array-like, optional.
        The user-provided initial precisions (inverse of the covariance
        matrices), defaults to None.
        If it None, precisions are initialized using the 'init_params' method.
        The shape depends on 'covariance_type'::

            (n_components,)                        if 'spherical',
            (n_features, n_features)               if 'tied',
            (n_components, n_features)             if 'diag',
            (n_components, n_features, n_features) if 'full'

    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

    warm_start : bool, default to False.
        If 'warm_start' is True, the solution of the last fitting is used as
        initialization for the next call of fit(). This can speed up
        convergence when fit is called several times on similar problems.
        In that case, 'n_init' is ignored and only a single initialization
        occurs upon the first call.
        See :term:`the Glossary <warm_start>`.

    verbose : int, default to 0.
        Enable verbose output. If 1 then it prints the current
        initialization and each iteration step. If greater than 1 then
        it prints also the log probability and the time needed
        for each step.

    verbose_interval : int, default to 10.
        Number of iteration done before the next print.
    """
    def __init__(self, n_components=2, covariance_type='full', tol=1e-6,
                 reg_covar=1e-6, max_iter=1000, n_init=1, init_params='kmeans',
                 weights_init=None, means_init=None, precisions_init=None,
                 random_state=41, warm_start=False,
                 verbose=0, verbose_interval=10):
        super().__init__(n_components, covariance_type, tol, reg_covar,
                         max_iter, n_init, init_params, weights_init, means_init,
                         precisions_init, random_state, warm_start, verbose, verbose_interval)
    def fit(self, X):
        """
        Fit the mixture model parameters.

        EM algorithm is applied here to estimate the model parameters and build a
        Mixture distribution (see openturns mixture). 
        The method fits the model ``n_init`` times and sets the parameters with
        which the model has the largest likelihood or lower bound. Within each
        trial, the method iterates between E-step and M-step for ``max_iter``
        times until the change of likelihood or lower bound is less than
        ``tol``, otherwise, a ``ConvergenceWarning`` is raised.
        If ``warm_start`` is ``True``, then ``n_init`` is ignored and a single
        initialization is performed upon the first call. Upon consecutive
        calls, training starts where it left off.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            List of n_features-dimensional data points. Each row
            corresponds to a single data point.

        Returns
        -------
        """
        data = np.array(X)
        # Evaluate the model parameters.
        super().fit(data)
        # openturns mixture
    
        # n_components ==> weight of size n_components
        weights = self.weights_
        n_components = len(weights)
        # Create ot distribution
        collection = n_components * [0]
        # Covariance matrices
        cov = self.covariances_
        mu = self.means_
        # means : n_components x n_features
        n_components, n_features = mu.shape

        # Following the type of covariance, we define the collection of gaussians

        # Spherical : C_k = Identity * sigma_k
        if self.covariance_type is 'spherical':
            c = ot.CorrelationMatrix(n_features)
            for l in range(n_components):
                sigma = np.sqrt(cov[l])
                collection[l] = ot.Normal(list(mu[l]), [ sigma ] * n_features  , c)

        elif self.covariance_type is 'diag' :
            for l in range(n_components):
                c = ot.CovarianceMatrix(n_features)
                for i in range(n_features):
                    c[i,i] = cov[l, i]
                collection[l] = ot.Normal(list(mu[l]), c)

        elif self.covariance_type == 'tied':
            # Same covariance for all clusters
            c = ot.CovarianceMatrix(n_features)
            for i in range(n_features):
                for j in range(0, i+1):
                    c[i,j] = cov[i,j]
            # Define the collection with the same covariance
            for l in range(n_components):
                collection[l] = ot.Normal(list(mu[l]), c)
            
        else:
            n_features = cov.shape[1]
            for l in range(n_components):
                c = ot.CovarianceMatrix(n_features)
                for i in range(n_features):
                    for j in range(0, i+1):
                        c[i,j] = cov[l][i,j]
                collection[l] = ot.Normal(list(mu[l]), c)

        self._mixture = ot.Mixture(collection, weights)
        return self

    def get_mixture(self):
        """
        Returns the Mixture object
        """
        check_is_fitted(self)
        return self._mixture

if __name__ == "__main__":
    mu1 = 1.0
    sigma1 = 0.5
    mu2 = 3.0
    sigma2 = 2.0
    weights = [0.3, 0.7]
    n1 = ot.Normal(mu1, sigma1)
    n2 = ot.Normal(mu2, sigma2)
    m = ot.Mixture([n1, n2], weights)
    x = m.getSample(1000)
    est_dist = MixtureFactory(random_state=1)
    est_dist.fit(x)
    print(est_dist.get_mixture())

score 1 · Answer 2 · answered Mar 23 '20 at 08:59

1

I have actually tried this method and it works perfectly. On top of that the fit of the model through the SciKit GMM and the ulterior adjustment thanks to OpenTurns are very fast. I recommend future users to test several numbers of components and covariance matrix structures, as it will not take a lot of time and can substantially improve the goodness of fit to the data.

Thanks for the answer.

answered Mar 23 '20 at 08:59

Alvaro Rollon de Pinedo

11
1

The best way to say thanks for an answer is by accepting it or upvoting it. If you do feel the need to comment on an answer then rather consider commenting instead of making a new answer. – Andy Mar 23 '20 at 10:36
1

I know, but i've just created my profile and apparently I must have a reputation of 50 in order to be able to comment on other people's responses, and 15 to upvote any post or answer. Since I am blocked by these limitations, I prefered to write an answer with some information that may be interesting for future readers, and make the most of it by thaking the first user. Anyhow, thank you for your comment, I will make sure to follow it as soon as I can. – Alvaro Rollon de Pinedo Mar 23 '20 at 14:31

josephmure · Answer 3 · 2020-09-02T22:36:52.127

Here is a pure OpenTURNS solution. It is probably slower than the scikit-learn-based method, but it is more generic: you could use it to estimate the parameters of any mixture model, not necessarily a mixture of normal distributions.

The idea is to retrieve the log-likelihood function from the Mixture object and minimize it. In the following, let us assume that s is the sample we want to fit the mixture on.

First, we need to build the mixture we want to estimate the parameters of. We can specify any valid set of parameters, it does not matter. In your example, you want a mixture of 2 normal distributions.

mixture = ot.Mixture([ot.Normal()]*2, [0.5]*2)

There is a small hurdle. All weights sum to 1, thus one of them is determined by the others: the solver must not be allowed to freely set it. The order of the parameters of an OpenTURNS Mixture is as follows:

weight of the first distribution;
parameters of the first distribution;
weight of the second distribution;
parameters of the second distribution:
...

You can view all parameters with mixture.getParameter() and their names with mixture.getParameterDescription(). The following is a helper function that:

takes as input the list containing of all mixture parameters except the weight of its first distribution;
outputs a Point containing all parameters including the weight of the first distribution.

def full(params):
    """
    Point of all mixture parameters from a list that omits the first weight.
    
    """
    params = ot.Point(params)
    aux_mixture = ot.Mixture(mixture)
    dist_number = aux_mixture.getDistributionCollection().getSize()

    index = aux_mixture.getDistributionCollection()[0].getParameter().getSize()
    list_weights = []
    for num in range(1, dist_number):
        list_weights.append(params[index])
        index += 1 + aux_mixture.getDistributionCollection()[num].getParameter().getSize()

    complementary_weight = ot.Point([abs(1.0 - sum(list_weights))])
    complementary_weight.add(params)
    return complementary_weight

The next function computes the opposite of the log-likelihood of a given list of parameters (except the first weight). For the sake of numerical stability, it divides this value by the number of observations.

We will minimize this function in order to find the Maximum Likelihood Estimate.

def minus_log_pdf(params):
    """
    - log-likelihood of a list of parameters excepting the first weight
    divided by the number of observations
    """
    aux_mixture = ot.Mixture(mixture)

    full_params = full(params)

    try:
        aux_mixture.setParameter(full_params)
    except TypeError:
        # case where the proposed parameters are invalid:
        # return a huge value
        return [ot.SpecFunc.LogMaxScalar]
    res = - aux_mixture.computeLogPDF(s).computeMean()
    return res

To use OpenTURNS optimization facilities, we need to turn this function into a PythonFunction object.

OT_minus_log_pdf = ot.PythonFunction(mixture.getParameter().getSize()-1, 1, minus_log_pdf)

Cobyla is usually good at likelihood optimization.

problem = ot.OptimizationProblem(OT_minus_log_pdf)
algo = ot.Cobyla(problem)

In order to decrease chances of Cobyla being stuck on a local minimum, we are going to use MultiStart. We pick a starting set of parameters and randomly change the weights. The following helper function makes it easy:

def random_weights(params, nb):
    """
    List of nb Points representing mixture parameters with randomly varying weights.
    """
    aux_mixture = ot.Mixture(mixture)
    full_params = full(params)
    aux_mixture.setParameter(full_params)
    list_params = []
    for num in range(nb):
        dirichlet = ot.Dirichlet([1.0] * aux_mixture.getDistributionCollection().getSize()).getRealization()
        dirichlet.add(1.0 - sum(dirichlet))
        aux_mixture.setWeights(dirichlet)
        list_params.append(aux_mixture.getParameter()[1:])
    return list_params

We pick 10 starting points and increase the number of maximum evaluations of the log-likelihood from 100 (by default) to 10000.

init = mixture.getParameter()[1:]
starting_points = random_weights(init, 10)
algo_multistart = ot.MultiStart(algo, starting_points)
algo_multistart.setMaximumEvaluationNumber(10000)

Let's run the solver and retrieve the result.

algo_multistart.run()
result = algo_multistart.getResult()

All that remains is to set the mixture's parameters to the optimal value. We must not forget to add the first weight back!

optimal_parameters = result.getOptimalPoint()
mixture.setParameter(full(optimal_parameters))

score 0 · Answer 4 · answered Sep 03 '20 at 21:31

Below is an alternative. The first step creates a new GaussianMixture class, derived from PythonDistribution. The key point is to implement the computeLogPDF method and the set/getParameters methods. Notice that this parametrization of a mixture only has one single weight w.

class GaussianMixture(ot.PythonDistribution):
    def __init__(self, mu1 = -5.0, sigma1 = 1.0, \
                 mu2 = 5.0, sigma2 = 1.0, \
                 w = 0.5):
        super(GaussianMixture, self).__init__(1)
        if w < 0.0 or w > 1.0:
            raise ValueError('The weight is not in [0, 1]. w=%s.' % (w))
        self.mu1 = mu2
        self.sigma1 = sigma1
        self.mu2 = mu2
        self.sigma2 = sigma2
        self.w = w
        collDist = [ot.Normal(mu1, sigma1), ot.Normal(mu2, sigma2)]
        weight = [w, 1.0 - w]
        self.distribution = ot.Mixture(collDist, weight)

    def computeCDF(self, x):
        p = self.distribution.computeCDF(x)
        return p

    def computePDF(self, x):
        p = self.distribution.computePDF(x)
        return p

    def computeQuantile(self, prob, tail = False):
        quantile = self.distribution.computeQuantile(prob, tail)
        return quantile

    def getSample(self, size):
        X = self.distribution.getSample(size)
        return X

    def getParameter(self):
        parameter = ot.Point([self.mu1, self.sigma1, \
                              self.mu2, self.sigma2, \
                              self.w])
        return parameter

    def setParameter(self, parameter):
        [mu1, sigma1, mu2, sigma2, w] = parameter
        self.__init__(mu1, sigma1, mu2, sigma2, w)
        return parameter

    def computeLogPDF(self, sample):
        logpdf = self.distribution.computeLogPDF(sample)
        return logpdf

In order to create the distribution, we use the Distribution class:

gm = ot.Distribution(GaussianMixture())

Estimating the parameters of this distribution is straightforward with MaximumLikelihoodFactory. However, we must set the bounds, because sigma cannot be negative and that w is in (0, 1).

factory = ot.MaximumLikelihoodFactory(gm)
lowerBound = [0.0, 1.e-6, 0.0, 1.e-6, 0.01]
upperBound = [0.0, 0.0,   0.0, 0.0,   0.99]
finiteLowerBound = [False, True, False, True, True]
finiteUpperBound = [False, False, False, False, True]
bounds = ot.Interval(lowerBound, upperBound, finiteLowerBound, finiteUpperBound)
factory.setOptimizationBounds(bounds)

Then we configure the optimization solver.

solver = factory.getOptimizationAlgorithm()
startingPoint = [-4.0, 1.0, 7.0, 1.5, 0.3]
solver.setStartingPoint(startingPoint)
factory.setOptimizationAlgorithm(solver)

Estimating the parameters is based on the build method.

distribution = factory.build(sample)

There are two limitations with this implementation.

First, it is not as fast as it should be, because of a limitation of the PythonDistribution (see https://github.com/openturns/openturns/issues/1391).
Estimating the parameters may be difficult, because the problem may have local optimas that cannot be retrieved with the default algorithm in MaximumLikelihoodFactory. This kind of task is generally done with the EM algorithm.

How to estimate the parameters of a mixture model in OpenTURNS?

4 Answers4