1

I have the following plot:

Overlapping normal distributions

I would like to estimate the means and standard deviations of the apparent overlapping normal distributions. This is slightly complicated by the fact that since the data is based on hour of the day, it also is circular -- the right end of the tail(s) leak into the left end.

How do I handle this?

  • Interesting question, it's better suited to stats.stackexchange.com since it is more conceptual. That said, a conventional approach is to fit a so-called mixture distribution to your data. The problem is a little off the beaten path since the data are circular. I seem to recall the obvious generalization of the Gaussian density to a circle is the so-called von Mises distribution. On the face of it, it seems like fitting a mixture of two von Mises bumps should have no special difficulties. Good luck and have fun. – Robert Dodier Oct 29 '21 at 20:04
  • Thanks @RobertDodier. It looks like an interesting problem. If I can identify that the data has two means and standard deviations, I can investigate how other data might be a part of what are the differences between the two distributions. – Mark Solinski Oct 29 '21 at 23:08

1 Answers1

2

I'd like to thank Robert Dodier and Adrian Keister for the start and the GitHub project provided by Emily Grace Ripka: Peak fitting Jupyter notebook

I was able to approximate the two different overlapped distributions with von Mises distributions and then optimized the predictions to minimize the error by selecting the mean and kappa (equivalent to the standard deviation of a von Mises distribution).

I was able to accomplish this with the SciPy Python module classes: scipy.stats.vonmises and scipy.optimize.curve_fit

I created the following two helper functions:

def two_von_mises(x, amp1, cen1, kappa1, amp2, cen2, kappa2):
    return (amp1 * vonmises.pdf(x-cen1, kappa1)) + \
           (amp2 * vonmises.pdf(x-cen2, kappa2))

def one_von_mises(x, amp, cen, kappa):
    return amp * vonmises.pdf(x-cen, kappa)

I needed to convert the time of day to an interval range from -pi <= {time of day} < pi, like so:

hourly_df['Angle'] = ((two_pi * hourly_df['HourOfDay']) / 24) - np.pi

I was then able to use the curve_fit function of the scipy.optimize module like so:

popt, pcov = curve_fit(two_von_mises, hourly_df['Angle'], hourly_df['Count'], p0 = [1, 11, 1, 1, 18, 1])

From this I got all the estimates of the parameters for the two distributions (from the popt variable above):

array([1.66877995e+04, 2.03310292e+01, 2.03941267e+00, 3.61717300e+04,
       2.46426705e+01, 1.32666704e+00])

Plotting this we see: Data with superimposed von Mises pdf graphed The next steps will be to see if we can determine what distribution a query belongs to based on categorical data collected for each query, but that is another story...

Thanks!