Find ideal sampling points for data sets with arbitrary sampling points

Question

I have measurement data that has pretty arbitrary sampling points. For instance the sampling points of 3 curves might be

[0.1, 0.15, 0.17, 0.18, 0.185, 20, 1000, 15000]
[0.09, 0.151, 0.169, 0.18, 21, 14000]
[0.11, 0.2, 13999, 14001]

(the corresponding y-values are omitted). In order calculate the mean I interpolate all curves linearly using scipy interp1d and find the common support. Finally I am looking for the sensible setpoints at which i evaluate the mean.

np.linspace(min(common_support), max(common_support), num)

will be very inefficient as num would have to be extremely large for sufficient resolution around 0. In this particular case I would need a couple of setpoints around 0.1-0.2 and some at 20, 14000, 15000.

I tried to calculate a probability density function of all the sampling points using

# common support is the set of all x-values in the common support of all funtions
kernel = stats.gaussian_kde(common_support)
class rv(stats.rv_continuous):
        def _rvs(self, *x, **y):
            return kernel.resample(int(self._size))

which doesn't work very well, because my distribution is often not gaussian at all.

TL:DR: I need x-values to evaluate the mean at which is distributed similarly like the set of all x-values in the common support of the data.

One thing which came in my mind [earlier today](http://stackoverflow.com/q/33625236/2460374): Would `np.logspace` help you? — jkalden, Nov 10 '15 at 13:55
unfortunately not, as the bulk of sampling points may be at the beginning, in the middle, at the end or maybe even inbetween. There might also be multiple bulks — stebu92, Nov 10 '15 at 13:57
Let me see if I understand correctly: there is some unknown function f(x), and you only have a selection of x-values x_i (which you listed in the question), as well as corresponding values f(x_i) (which you didn't), and you would like to use the values you have to approximate the mean of f(x) over some range [a,b]? — David Z, Nov 10 '15 at 14:10
I have a set of meausrement results f_i. Each consists of x_i (the sampling points) and corresponding y_i (the measurement value, which is omitted in the question). I now want to calculate mean(f_i). Unfortunately the x_i can be very different (as indicated by the example). Given the range [a,b] and all interpolated functions f_int,i, I know need to find ideal sampling points to calculate that mean. — stebu92, Nov 10 '15 at 14:55

score 0 · Answer 1 · answered Jul 06 '16 at 15:03

You are using linear interpolation. The integral of a piecewise linear function is computed precisely by applying the trapezoidal rule with the sample points being the vertices of the polygonal line, i.e., your data points. The mean value is the integral divided by the range of integration. So, just use

mean = np.trapz(y, [0.1, 0.15, 0.17, 0.18, 0.185, 20, 1000, 15000])/(15000 - 0.1)

where y is the vector of y-values.

Find ideal sampling points for data sets with arbitrary sampling points

1 Answers1