-1

I have a numpy array whose values are distributed in the following manner

enter image description here

From this array I need to get a random sub-sample which is normally distributed.

enter image description here

I need to get rid of the values from the array which are above the red line in the picture. i.e. I need to get rid of some occurences of certain values from the array so that my distribution gets smoothened when the abrupt peaks are removed.

And my array's distribution should become like this: enter image description here

Can this be achieved in python, without manually looking for entries corresponding to the peaks and remove some occurences of them ? Can this be done in a simpler way ?

Sampath Kumar
  • 165
  • 2
  • 10
  • Increase your bin size? Your distribution is always going to look spiky if your bin sizes are small comapred to your variance. – Daniel F Dec 08 '17 at 07:08
  • Also, pruning your sample to fit a distribution is terrible practice in just about any discipline. – Daniel F Dec 08 '17 at 07:17

1 Answers1

0

The following kind of works, it is rather aggressive, though: enter image description here

It works by ordering the samples, transforming to uniform and then trying to select a regular griddish subsample. If you feel it is too aggressive you could increase ns which is essentially the number of samples kept.

Also, please note that it requires the knowledge of the true distribution. In case of normal distribution you should be fine with using sample mean and unbiased variance estimate (the one with n-1).

Code (without plotting):

import scipy.stats as ss
import numpy as np

a = ss.norm.rvs(size=1000)
b = ss.uniform.rvs(size=1000)<0.4
a[b] += 0.1*np.sin(10*a[b])

def smooth(a, gran=25):
    o = np.argsort(a)
    s = ss.norm.cdf(a[o])
    ns = int(gran / np.max(s[gran:] - s[:-gran]))
    grid, dp = np.linspace(0, 1, ns, endpoint=False, retstep=True)
    grid += dp/2
    idx = np.searchsorted(s, grid)
    c = np.flatnonzero(idx[1:] <= idx[:-1])
    while c.size > 0:
        idx[c+1] = idx[c] + 1
        c = np.flatnonzero(idx[1:] <= idx[:-1])
    idx = idx[:np.searchsorted(idx, len(a))]
    return o[idx]


ap = a[smooth(a)]
c, b = np.histogram(a, 40)
cp, _ = np.histogram(ap, b)
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99