7

I'm trying to fit a sum of gaussians using scikit-learn because the scikit-learn GaussianMixture seems much more robust than using curve_fit.

Problem: It doesn't do a great job in fitting a truncated part of even a single gaussian peak:

from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np

clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
data = np.random.randn(10000)
data = [[x] for x in data]
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = matplotlib.pyplot.hist(data, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
         mlab.normpdf(np.linspace(rangeMin, rangeMax),
                      clf.means_, np.sqrt(clf.covariances_[0]))[0])

gives enter image description here now changing data = [[x] for x in data] to data = [[x] for x in data if x <0] in order to truncate the distribution returns enter image description here Any ideas how to get the truncation fitted properly?

Note: The distribution isn't necessarily truncated in the middle, there could be anything between 50% and 100% of the full distribution left.

I would also be happy if anyone can point me to alternative packages. I've only tried curve_fit but couldn't get it to do anything useful as soon as more than two peaks are involved.

lhcgeneva
  • 1,981
  • 2
  • 21
  • 29
  • 2
    What did you expect? One half of a gaussian is not a gaussian so all fits of one simple gaussian can't produce good results. As you are trying gaussian mixtures, you may use **mixtures**. But for a mixture of **multiple gaussians**, you need more than ```n_components=1```. That's the whole idea there. – sascha Jan 30 '17 at 00:18
  • Not sure that's true. If you do the fit for a single gaussian using curve_fit it works quite nicely. Only that this doesn't work robustly for more than one peak. Also, `n_components=1` for simplicity... – lhcgeneva Jan 30 '17 at 09:27
  • What exactly causes you trouble with the `curve_fit` approach? Have you tried other minimisation methods from `scipy.optimize`? It is quite a powerful toolbox and, I think, should not be dismissed easily. – Vlas Sokolov Jan 31 '17 at 10:26
  • My data are quite noisy and judging by eye GMM usually finds good results - except when part of the data are truncated. I think the strength of GMM is finding good seeds for the fit, which is hard to implement for curve_fit. When using curve_fit half the time the fit doesn't even converge. Haven't tried any other optimizations routines on this problem, but in my experience they are all quite dependent on the initial values. – lhcgeneva Jan 31 '17 at 10:51
  • @lhcgeneva not exactly. While `curve_fit`, and similar gradient-descent-ish methods, do indeed depend on the initial guesses, there is still a number of routines that are dedicated to a global minimum search. For instance, you can try looking into differential evolution (`scipy.optimize.differential_evolution`), which, from my experience, can give reasonable recovery of gaussian mixtures. – Vlas Sokolov Jan 31 '17 at 11:29
  • Another way to recover components of the multimodal normal distribution may be to employ Bayesian inference - it should be possible to constrain the parameters of individual Gaussian components with, e.g. `pymc` or `emcee` Python packages. – Vlas Sokolov Jan 31 '17 at 11:38
  • Any progress on this? – O.rka Aug 07 '17 at 19:24
  • Nope, unfortunately nothing so far that would robustly fit several peaks. – lhcgeneva Aug 09 '17 at 12:52
  • The scipy stats module has a specific statistical distribution for truncated gaussian data that is named named truncnorm, it fit your example truncated data as expected when I tried it. If you might be able to use it, the scipy documentation for truncnorm at: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html – James Phillips Jan 30 '17 at 09:49

2 Answers2

2

A bit brutish, but simple solution would be to split the curve in two halfs (data = [[x] for x in data if x < 0]), mirror the left part (data.append([-data[d][0]])) and then do the regular Gaussian fit.

import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

np.random.seed(seed=42)
n = 10000

clf = mixture.GaussianMixture(n_components=1, covariance_type='full')

#split the data and mirror it
data = np.random.randn(n)
data = [[x] for x in data if x < 0]
n = len(data)
for d in range(n):
    data.append([-data[d][0]])

clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = plt.hist(data[0:n], bins=20, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
         mlab.normpdf(np.linspace(rangeMin, rangeMax),
                      clf.means_, np.sqrt(clf.covariances_[0]))[0] * 2)

plt.show()

enter image description here

Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • This looks great, but what I should've mentioned was that it's not necessarily half of the distribution, but can be an arbitrary truncation (usually leaving a bit more than half of the distribution to fit). Sorry about this, I amended the question. – lhcgeneva Jan 30 '17 at 15:38
1

lhcgeneva the problem is once you have data that doesn't include the maximum of the curve more and more possible Gaussians can fit:

Black point represent the data, red points the fitted Gaussian

In the figure, black points represent the data to fit a curve, the red points the fitted results. This result was achieved by using A Simple Algorithm for Fitting a Gaussian Function

Ed Breen
  • 11
  • 2