5

I have fit a series of SciPy continuous distributions for a Monte-Carlo simulation and am looking to take a large number of samples from these distributions. However, I would like to be able to take correlated samples, such that the ith sample takes the e.g., 90th percentile from each of the distributions.

In doing this, I've found a quirk in SciPy performance:

# very fast way to many uncorrelated samples of length n
for shape, loc, scale, in distro_props:
    sp.stats.norm.rvs(*shape, loc=loc, scale=scale, size=n)

# verrrrryyyyy slow way to take correlated samples of length n
correlate = np.random.uniform(size=n)
for shape, loc, scale, in distro_props:
    sp.stats.norm.ppf(correlate, *shape, loc=loc, scale=scale)

Most of the results about this claim that the slowness on these SciPy distros if from the type-checking etc. wrappers. However when I profiled the code, the vast bulk of the time is spent in the underlying math function [_continuous_distns.py:179(_norm_pdf)]1. Furthermore, it scales with n, implying that it's looping through every elemnt internally.

The SciPy docs on rv_continuous almost seem to suggest that the subclass should override this for performance, but it seems bizarre that I would monkeypatch into SciPy to speed up their ppf. I would just compute this for the normal from the ppf formula, but I also use lognormal and skewed normal, which are more of a pain to implement.

So, what is the best way in Python to compute a fast ppf for normal, lognormal, and skewed normal distributions? Or more broadly, to take correlated samples from several such distributions?

NealJMD
  • 864
  • 8
  • 15
  • If I understand correctly, the second option corresponds to numbers in the right half of the distribution (since uniform is in [0, 1]). I'm surprised the first option isn't implemented as `ppf(uniform(-1, 1, size=n), loc=loc, scale=scale)` – Mad Physicist Feb 04 '20 at 20:18
  • Are you restricted to scipy? Maybe a probabilistic programming language has a faster implementation? – OriolAbril Feb 07 '20 at 02:45

2 Answers2

3

If you need just the normal ppf, it is indeed puzzling that it is so slow, but you can use scipy.special.erfinv instead:

x = np.random.uniform(0,1,100)
np.allclose(special.erfinv(2*x-1)*np.sqrt(2),stats.norm().ppf(x))
# True
timeit(lambda:stats.norm().ppf(x),number=1000)
# 0.7717257660115138
timeit(lambda:special.erfinv(2*x-1)*np.sqrt(2),number=1000)
# 0.015020604943856597

EDIT:

lognormal and triangle are also straight forward:

c = np.random.uniform()

np.allclose(np.exp(c*special.erfinv(2*x-1)*np.sqrt(2)),stats.lognorm(c).ppf(x))
# True

np.allclose(((1-np.sqrt(1-(x-c)/((x>c)-c)))*((x>c)-c))+c,stats.triang(c).ppf(x))
# True

skew normal I'm not familiar enough, unfortunately.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Thanks! This is a good answer for normal only, but I was hoping that I could avoid re-implementing all the distributions like this that I use. I've also got lognormal, skewed normal, and triangular. Looks like that may be the only option, but it still seems like a problem that would be well solved in the library – NealJMD Feb 06 '20 at 23:52
  • @NealJMD See update, not all you need, unfortunately, but almost. – Paul Panzer Feb 07 '20 at 02:31
0

Ultimately, this issue was caused by my use of the skew-normal distribution. The ppf of the skew-normal actually does not have a closed-form analytic definition, so in order to compute the ppf, it fell back to scipy.continuous_rv's numeric approximation, which involved iteratively computing the cdf and using that to zero in on the ppf value. The skew-normal pdf is the product of the normal pdf and normal cdf, so this numeric approximation called the normal's pdf and cdf many many times. This is why when I profiled the code, it looked like the normal distribution was the problem, not the SKU normal. The other answer to this question was able to achieve time savings by skipping type-checking, but didn't actually make a difference on the run-time growth, just a difference on small-n runtimes.

To solve this problem, I have replaced the skew-normal distribution with the Johnson SU distribution. It has 2 more free parameters than a normal distribution so it can fit different types of skew and kurtosis effectively. It's supported for all real numbers, and it has a closed-form ppf definition with a fast implementation in SciPy. Below you can see example Johnson SU distributions I've been fitting from the 10th, 50th, and 90th percentiles.

Example distributions

NealJMD
  • 864
  • 8
  • 15