8

I measured the fluorescence intensity of thousands of particles and made the histogram, which showed two adjacent gaussian curves. How to use python or its package to separate them into two Gaussian curves and make two new plots?enter image description here

Thank you.

Steve Xu
  • 489
  • 1
  • 7
  • 12

1 Answers1

13

Basically, you need to infer parameters for your Gaussian mixture. I will generate a similar dataset for the illustration.

Generating mixtures with known parameters

from itertools import starmap

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import mlab
sns.set(color_codes=True)
# inline plots in jupyter notebook
%matplotlib inline


# generate synthetic data from a mixture of two Gaussians with equal weights
# the solution below readily generalises to more components 
nsamples = 10000
means = [30, 120]
sds = [10, 50]
weights = [0.5, 0.5]
draws = np.random.multinomial(nsamples, weights)
samples = np.concatenate(
    list(starmap(np.random.normal, zip(means, sds, draws)))
)

Plot the distribution

sns.distplot(samples)

Infer parameters

from sklearn.mixture import GaussianMixture

mixture = GaussianMixture(n_components=2).fit(samples.reshape(-1, 1))
means_hat = mixture.means_.flatten()
weights_hat = mixture.weights_.flatten()
sds_hat = np.sqrt(mixture.covariances_).flatten()

print(mixture.converged_)
print(means_hat)
print(sds_hat)
print(weights_hat)

We get:

True
[ 122.57524745   29.97741112]
[ 48.18013893  10.44561398]
[ 0.48559771  0.51440229]

You can tweak GaussianMixture's hyper-parameters to improve fit, but this looks fine enough. Now we can plot each component (I'm only plotting the first one):

mu1_h, sd1_h = means_hat[0], sds_hat[0]
x_axis = np.linspace(mu1_h-3*sd1_h, mu1_h+3*sd1_h, 1000)
plt.plot(x_axis, mlab.normpdf(x_axis, mu1_h, sd1_h))

enter image description here

P.S.

On a sidenote. It seems like you are dealing with constrained data, and your observations are pretty close to the left constraint (zero). While Gaussians might approximate your data well enough, you should tread carefully, because Gaussians assume unconstrained geometry.

Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73
  • Thank you so much! – Steve Xu Jul 13 '18 at 14:28
  • @Eli Is there also a way to find the probability that the proportion of data does indeed follow a normal distribution? – Riley Dec 28 '18 at 10:18
  • @Riley it depends on your definition of "proportion". Can you elaborate further? Normally you would use the Shapiro-Wilk test to estimate the probability of observing random empirical deviation from a univariate Gaussian. It's quite a bit more complicated with mixtures and multivariate Gaussians. – Eli Korvigo Dec 28 '18 at 11:09
  • @EliKorvigo, suppose you construct 2 F-distributions and mix them together like you did in the beginning. Then you can apply this same method and get a Gaussian Mixture, but the distributions themselves are not normal. My question is: is there a way to see if the normal distributions/outcomes are appropriate for the 'relevant' parts of the distributions? I know you can test for normality using Shapiro, but then we need the 'cutoff' point of the distributions and test them separately, so I wondered if this package can provide an easier and/or shorter way, something like a p-value of sorts. – Riley Dec 28 '18 at 12:21
  • @Riley well, that falls under the "quite a bit more complicated" category. I'd say, your best bet is to fit a Gaussian mixture model and an F-distribution mixture model and perform some a likelihood ratio test if you want a p-value. – Eli Korvigo Dec 30 '18 at 09:50
  • 1
    @Riley just to clarify, the package I've used here doesn't provide this functionality. If this is something you are struggling with, I guess you should ask a question on Cross-validated. – Eli Korvigo Dec 30 '18 at 22:54
  • 1
    Quite late to the party but for the 2021 reader, the function `normpdf` has been removed from mlab. One could use `norm.pdf` from `scipy.stats` . – eidal Jul 10 '21 at 19:26