Math percentile on histogram bins. Edge error

Question

I have a set of points along X and Y where I want to create bins on X for small ranges and calculate the percetile for each bin to create a polynomial regression fit for all the bins and have a continuous percentile approach. The problem is on the edges. The number of points on the edge bins is lower and due to this problem, the percentile values are distorted.

The folowing image and the code shows how everything is done. The percentile values as you can see are calculated for a maximum defined on 99.8 and a minimum defined on 4.0:

import numpy as np
import matplotlib.pyplot as plt

###############################
degree = 8
step = 0.05
numPercUp = 99.8
numPercDown = 4.0

###############################
fig = plt.figure(figsize=(8, 6))

dataX = np.random.uniform(low=0.14, high=2.06, size=(1000))
dataY = np.random.uniform(low=50, high=550, size=(1000))
plt.scatter(dataX, dataY, c='b', s=5, marker="+", label="data")

xMin = np.min(dataX)
xMax = np.max(dataX)
print 'xMin: ', xMin
print 'xMax: ', xMax

xMin = (int(xMin / step)+1) * step
xMax = (int(xMax / step)+1) * step
print 'xMin: ', xMin
print 'xMax: ', xMax

bins = np.arange(xMin, xMax, step)
inds = np.digitize(dataX, bins) # http://stackoverflow.com/questions/2275924/how-to-get-data-in-a-histogram-bin
print 'bins: ', bins, bins[0], bins[-1], len(bins)
print 'inds: ', np.min(inds), np.max(inds), np.sum(inds == 0)


# Percentile coordinates
percX    = np.arange(xMin, xMax+step, step) - step/2    # All bin X position centered on the bin
percUp   = np.zeros(len(bins)+1)
percDown = np.zeros(len(bins)+1)

for i in range(len(bins)+1):
    dataBin = dataY[inds == i]
    percUp[i]   = np.percentile(dataBin, numPercUp)
    percDown[i] = np.percentile(dataBin, numPercDown)

print 'percX: ', percX
print 'percUp: ', percUp
plt.plot(percX, percUp, color='green', linestyle='-', linewidth=2, marker="o", markerfacecolor='red', markersize=5, label="Up perc.")
plt.plot(percX, percDown, color='green', linestyle='-', linewidth=2, marker="o", markerfacecolor='red', markersize=5, label="Down perc.")


# Polynomial Regression
z = np.polyfit(percX, percUp, degree)
f = np.poly1d(z)
x_new = np.linspace(0.1, 2.1, 50)
y_new = f(x_new)
plt.plot(x_new, y_new, 'r--')

z = np.polyfit(percX, percDown, degree)
f = np.poly1d(z)
x_new = np.linspace(0.1, 2.1, 50)
y_new = f(x_new)
plt.plot(x_new, y_new, 'r--')


# Frame specification
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Perc. Min/Max')
plt.grid()
plt.legend()
plt.axis([0, 2.2, 0, 600])
plt.xticks(np.arange(0, 2.21, 0.1))
plt.yticks(np.arange(0, 600+1, 50))
plt.show()

I find 3 possibilities, but none of them convince me at all:

Make bigger bins instead of 0.05, but this will make to reduce accuracy.
Reduce the degree of the polynomial fit, but on the real data (see final image) it also reduces the accuracy and does not fit to expected.
Delete the edge values before fitting the polynomial regression (maybe this one is the most convincing option for me)

The following image is a real dataset example that I have. So as you can see, low degrre fit polynomial cannot be used, and make bigger bins want make to be much better on the right side (high x).

Which could be a good solution? Maybe forget about continuous and look for gaussian filter and then use nearest on X to define a smooth regression line?

what do you mean by "percentile values are distorted"—the percentiles are what they are! Maybe doing something other than a polynomial fit would be better? Do you want to do the extrapolation? — Sam Mason, Oct 21 '15 at 09:45
Sam Mason, As you can see on the last real image, the highest X axis values cannot hav a low value of the up precentile. However, the last percUp value is around 325, when it should be between 400-450. So that's why I said that it is distorted. Percentiles are what they are yes, but the last ones are not correct due to the lack of points on the edge bins so maybe a good extrapolation could be a solution and that's the reason for proposing to delete the edge bin values and guess them with a regression (which maybe could not be the best solution of course) — iblasi, Oct 21 '15 at 10:09
By extrapolation, I mean do you want to predict `percUp` for values greater than two?I think I'd transform the X values first; something to stretch out the larger values and thus give those values equal importance to the polynomial. — Sam Mason, Oct 21 '15 at 11:49
No, I am not looking for predicting values greater than 2. I just want to follow the tendency. I can clearly see that near 2, the upper percentile should be between 400-450, and that's what I am looking to guess. I am looking that maybe scipy.interpolate.UnivariateSpline with the corresponding smoothing maybe could be a solution — iblasi, Oct 21 '15 at 12:14

score 4 · Answer 1 · answered Oct 21 '15 at 13:45

Transform your Xs so you give more importance to the values you care about; for example, taking -log(2.1-X) will give you a basically linear response near 0 and exponential increase nearer 2. Using this function to determine the bins will give better estimates of the values near two.

Lets start by generating some dummy data:

X = np.linspace(0,2,50000)
Y = np.random.gamma(4, 0.1+(X**2)*(2-X)/(0.01+(2-X)))
plt.plot(X,Y,'.')
plt.margins(0.04)

define a function to transform the Xs and its inverse:

def xfrm(X): return -np.log(2.05-np.array(X))
def ivrt(Y): return 2.05-np.exp(-np.array(Y))

We can then get the histogram counts:

Xi = xfrm(X)
bins = np.linspace(np.min(Xi),np.max(Xi)+1e-5,201)
ii = np.digitize(Xi,bins)
pcts = np.array([np.percentile(Y[ii==i],[4,95]) for i in range(1,len(bins))])

and generate some plots to make sure it's behaving as expected:

fig,axs = plt.subplots(2,figsize=(8,10))

mids = bins[1:] - np.diff(bins)/2
axs[0].plot(X,Y,'.',zorder=1)
axs[0].vlines(ivrt(mids),pcts[:,0],pcts[:,1],lw=1);
axs[0].margins(0.04)
axs[1].plot(Xi,Y,'.',zorder=1)
axs[1].vlines(mids,pcts[:,0],pcts[:,1],lw=1);
axs[1].margins(0.04)

f = np.poly1d(np.polyfit(mids, pcts[:,1], 8))
axs[0].plot(ivrt(mids), f(mids),lw=3)
axs[1].plot(mids, f(mids))

f = np.poly1d(np.polyfit(mids, pcts[:,0], 8))
axs[0].plot(ivrt(mids), f(mids),lw=3)
axs[1].plot(mids, f(mids));

The upper plot is of the original values and the bottom is of the transformed values. The vertical lines show the values used to generate the fits.

I think I may have got a bit carried away with this one, but hope it's interesting!

Very interesting and I understood this from your previous comment, but maybe I did not explain correctly myself because I cannot use this. In your example the values on the right side of the graphs are dropping to 0, and looking to the upper graph, the upper percentile should be around 30 for a X~2.0 looking to the tendency. The green line, that represents the upper percentile does not show that in non of both graphs. — iblasi, Oct 21 '15 at 14:15

Math percentile on histogram bins. Edge error

1 Answers1