5

I have data that follow a Gaussian distribution. However, the data is truly Gaussian only for a range of values [xa,xb] so I want to fit a truncated normal distribution using scipy.stats.truncnorm while using the fact that I know the range [xa,xb]. My goal is to find loc and scale.

I don't understand how to fix xa and xb in the fit. The shape parameters are 'a' and 'b', but those depend on loc and scale, which are my unknowns. Moreover, it doesn't seem to be possible to put an initial guess on 'a' and 'b' (they can only be frozen with fa and fb?). When I do:

par = truncnorm.fit(r, a=a_guess, b=b_guess, scale= scale_guess, loc = loc_guess)

I get

Unknown arguments: {'a': 0.0, 'b': 2.4444444444444446}.

Also, the fits I get are very unstable. Here's a example:

from scipy.stats import truncnorm
import matplotlib.pyplot as plt

xa, xb = 30,250 
loc, loc_guess = 50, 30
scale, scale_guess = 75, 90
a,b = (xa-loc)/scale, (xb-loc)/scale

fig, ax = plt.subplots(1, 1)
x = np.linspace(xa,xb,10000)    
ax.plot(x, truncnorm.pdf(x, a, b, loc=loc, scale=scale),
        'r-', lw=5, alpha=0.6, label='truncnorm pdf')

r = truncnorm.rvs(a, b, loc=loc, scale=scale, size=10000)
par = truncnorm.fit(r, scale= scale_guess, loc = loc_guess)
ax.plot(x, truncnorm.pdf(x, *par),
        'b-', lw=1, alpha=0.6, label='truncnorm fit')
ax.hist(r, density=True, histtype='stepfilled', alpha=0.3)
plt.legend()
plt.show()

1st example 2nd example

I also often have this warning:

/home/elie/anaconda2/envs/py36/lib/python3.6/site-packages/scipy/stats/_continuous_distns.py:5823: RuntimeWarning: divide by zero encountered in log self._logdelta = np.log(self._delta)

E. B
  • 51
  • 1
  • 5

1 Answers1

5

As you have discovered, the problem is that the parameters that you want to keep fixed, xa and xb, are not native parameters of truncnorm. truncnorm has the shape parameters a and b, which determine the shape by setting the x-interval for the standard normal distribution. This shape is then shifted and scaled by the loc and scale parameters. The relation is

xa = a*scale + loc
xb = b*scale + loc

To fix xa and xb, you can use one of the SciPy minimizers that accepts equality constraints. Here I'll use scipy.optimize.fmin_slsqp. (You could instead use the "omnibus" function scipy.optmize.minimize, which includes the SLSQP solver as one of its options.)

Here's a script that demonstrate how to use fmin_slsqp for this problem. The function func is the objective function to be minimized. It is just a wrapper for truncnorm.nnlf, the negative log-likelihood function. The function constraint returns an array containing two values. These values are 0 when the constraint is satisfied.

import numpy as np
from scipy.stats import truncnorm
from scipy.optimize import fmin_slsqp

import matplotlib.pyplot as plt


def func(p, r, xa, xb):
    return truncnorm.nnlf(p, r)


def constraint(p, r, xa, xb):
    a, b, loc, scale = p
    return np.array([a*scale + loc - xa, b*scale + loc - xb])


xa, xb = 30, 250 
loc = 50
scale = 75

a = (xa - loc)/scale
b = (xb - loc)/scale

# Generate some data to work with.
r = truncnorm.rvs(a, b, loc=loc, scale=scale, size=10000)

loc_guess = 30
scale_guess = 90
a_guess = (xa - loc_guess)/scale_guess
b_guess = (xb - loc_guess)/scale_guess
p0 = [a_guess, b_guess, loc_guess, scale_guess]

par = fmin_slsqp(func, p0, f_eqcons=constraint, args=(r, xa, xb),
                 iprint=False, iter=1000)

xmin = 0
xmax = 300
x = np.linspace(xmin, xmax, 1000)

fig, ax = plt.subplots(1, 1)
ax.plot(x, truncnorm.pdf(x, a, b, loc=loc, scale=scale),
        'r-', lw=3, alpha=0.4, label='truncnorm pdf')
ax.plot(x, truncnorm.pdf(x, *par),
        'k--', lw=1, alpha=1.0, label='truncnorm fit')
ax.hist(r, bins=15, density=True, histtype='stepfilled', alpha=0.3)
ax.legend(shadow=True)
plt.xlim(xmin, xmax)
plt.grid(True)

plt.show()

Here's the plot that it generates. The sample data is random, so the plot will be different on each run.

plot

Note: occasionally a random data set is generated for which fmin_slsqp fails with an "invalid value encountered" during a calculation. I haven't investigated this further, but you might run into this with your data.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • Thank you! It works in more than half of my datasets. For the other half, I have the 'invalid value encountered' error unfortunately. – E. B Nov 08 '18 at 19:22
  • +1 Rather helpful, although I have also encountered convergence problems. In the end I settled for using `par = truncnorm.fit(r, loc=loc_guess, scale=scale_guess)`, since keeping the borders of the interval fixed is less crucial in my case. – Roger Vadim Jul 08 '21 at 08:54