subclassing of scipy.stats.rv_continuous

Question

I have 3 questions regarding subclassing of scipy.stats.rv_continuous. My goal is to code a statistical mixture model of a truncated normal distribution, a truncated exponential distribution and 2 uniform distributions.

1) Why is drawing random variates via mm_model.rvs(size = 1000) so extremely slow? I read something about performance issues in the documentary, but I was really surprised.

2) After drawing random variates via mm_model.rvs(size = 1000) I get this IntegrationWarning?

IntegrationWarning: The maximum number of subdivisions (50) has been achieved. If increasing the limit yields no improvement it is advised to analyze the integrand in order to determine the difficulties. If the position of a local difficulty can be determined (singularity, discontinuity) one will probably gain from splitting up the interval and calling the integrator on the subranges. Perhaps a special-purpose integrator should be used. warnings.warn(msg, IntegrationWarning)

3) I read in the documentary that I can transmit parameters to the pdf via the "shape" parameter. I tried to adjust my pdf and set the shape parameter but it did not work. Could someone explain it?

Thanks for any help.

def trunc_norm(z,low_bound,up_bound,mu,sigma):
    a = (low_bound - mu) / sigma
    b = (up_bound - mu) / sigma
    return stats.truncnorm.pdf(z,a,b,loc=mu,scale=sigma)



def trunc_exp(z,up_bound,lam):
    return stats.truncexpon.pdf(z,b=up_bound*lam,scale=1/lam)



def uniform(z,a,b):
    return stats.uniform.pdf(z,loc=a,scale=(b-a))


class Measure_mixture_model(stats.rv_continuous):

    def _pdf(self,z):

        z_true = 8
        z_min = 0
        z_max = 10
        p_hit = 0.7
        p_short = 0.1   
        p_max = 0.1
        p_rand = 0.1
        sig_hit = 1
        lambda_short = 0.5

        sum_p = p_hit + p_short + p_max + p_rand

        if sum_p < 0.99 or 1.01 < sum_p:
            misc.eprint("apriori probabilities p_hit, p_short, p_max, p_rand have to be 1!")
            return None

        pdf_hit = trunc_norm(z,z_min,z_max,z_true,sig_hit)
        pdf_short = trunc_exp(z,z_true,lambda_short)
        pdf_max = uniform(z,0.99*z_max,z_max)
        pdf_rand = uniform(z,z_min,z_max)

        pdf = p_hit * pdf_hit + p_short * pdf_short + p_max * pdf_max + p_rand * pdf_rand

        return pdf




#mm_model = Measure_mixture_model(shapes='z_true,z_min,z_max,p_hit,p_short,p_max,p_rand,sig_hit,lambda_short')
mm_model = Measure_mixture_model()


z = np.linspace(-1,11,1000)

samples = mm_model.pdf(z)
plt.plot(z,samples)
plt.show()


rand_samples = mm_model.rvs(size = 1000)


bins = np.linspace(-1, 11, 100)
plt.hist(rand_samples,bins)
plt.title("measure mixture model")
plt.xlabel("z: measurement")
plt.ylabel("p: relative frequency")
plt.show()

Paul Panzer · Accepted Answer · 2017-03-02T20:51:37.550

(1) and (2) are probably related. You are asking scipy to generate random samples based on only the density you provide.

I don't really know what scipy does, but I suspect it integrates the density ("pdf") to get the probability function ("cdf") and then inverts it to map uniform samples to your distribution. This is numerically expensive and as you experienced error prone.

To speed things up you can help scipy by implementing _rvs directly. Just draw a uniform to decide which sub-model of your mixture to select and then invoke the rvs of the selected sub-model. And similar for other functions you may require.

Here are some tips on how to implement a vectorised rvs:

To batch-select sub-models. Since your number of sub-models is small np.digitize should be good enough for this. If possible use rv_frozen instances for sub-models; they are very convenient, but I seem to remember that you can't pass all the optional parameters to them, so you may have to handle those separately.

self._bins = np.cumsum([p_hit, p_short, p_max])
self._bins /= self._bins[-1] + p_rand

submodel = np.digitize(uniform.rvs(size=size), self._bins)

result = np.empty(size)
for j, frozen in enumerate((frz_trunc_norm, frz_trunc_exp, frz_unif_1, frz_unif_2)):
    inds = np.where(submodel == j)
    result[inds] = frozen.rvs(size=inds.shape)
return result

Re (3) Here is what the scipy docs have to say.

A note on shapes: subclasses need not specify them explicitly. In this case, shapes will be automatically deduced from the signatures of the overridden methods (pdf, cdf etc). If, for some reason, you prefer to avoid relying on introspection, you can specify shapes explicitly as an argument to the instance constructor.

So the normal way would be to just put some arguments in your methods.

Re: why `rvs()` is slow if you only define `_pdf()`, yes, that's precisely correct. — Robert Kern, Mar 02 '17 at 23:47
Thanks for the answer. Drawing first from a uniform will speed up the code sure a lot, that's a great idea. I will implement and compare it. — Tom, Mar 04 '17 at 09:20

subclassing of scipy.stats.rv_continuous

1 Answers1

Linked