Normal distributed sub-sampling from a numpy array in python

Question

I have a numpy array whose values are distributed in the following manner

From this array I need to get a random sub-sample which is normally distributed.

I need to get rid of the values from the array which are above the red line in the picture. i.e. I need to get rid of some occurences of certain values from the array so that my distribution gets smoothened when the abrupt peaks are removed.

And my array's distribution should become like this:

Can this be achieved in python, without manually looking for entries corresponding to the peaks and remove some occurences of them ? Can this be done in a simpler way ?

Increase your bin size? Your distribution is always going to look spiky if your bin sizes are small comapred to your variance. — Daniel F, Dec 08 '17 at 07:08
Also, pruning your sample to fit a distribution is terrible practice in just about any discipline. — Daniel F, Dec 08 '17 at 07:17

Paul Panzer · Answer 1 · 2017-12-08T07:21:18.853

The following kind of works, it is rather aggressive, though:

It works by ordering the samples, transforming to uniform and then trying to select a regular griddish subsample. If you feel it is too aggressive you could increase ns which is essentially the number of samples kept.

Also, please note that it requires the knowledge of the true distribution. In case of normal distribution you should be fine with using sample mean and unbiased variance estimate (the one with n-1).

Code (without plotting):

import scipy.stats as ss
import numpy as np

a = ss.norm.rvs(size=1000)
b = ss.uniform.rvs(size=1000)<0.4
a[b] += 0.1*np.sin(10*a[b])

def smooth(a, gran=25):
    o = np.argsort(a)
    s = ss.norm.cdf(a[o])
    ns = int(gran / np.max(s[gran:] - s[:-gran]))
    grid, dp = np.linspace(0, 1, ns, endpoint=False, retstep=True)
    grid += dp/2
    idx = np.searchsorted(s, grid)
    c = np.flatnonzero(idx[1:] <= idx[:-1])
    while c.size > 0:
        idx[c+1] = idx[c] + 1
        c = np.flatnonzero(idx[1:] <= idx[:-1])
    idx = idx[:np.searchsorted(idx, len(a))]
    return o[idx]


ap = a[smooth(a)]
c, b = np.histogram(a, 40)
cp, _ = np.histogram(ap, b)

Normal distributed sub-sampling from a numpy array in python

1 Answers1