Mean values depending on binning with respect to second variable

Question

I am working with python / numpy. As input data I have a large number of value pairs (x,y). I basically want to plot <y>(x), i.e., the mean value of y for a certain data bin x. At the moment I use a plain for loop to achieve this, which is terribly slow.

# create example data
x = numpy.random.rand(1000)
y = numpy.random.rand(1000)
# set resolution
xbins = 100
# find x bins
H, xedges, yedges = numpy.histogram2d(x, y, bins=(xbins,xbins) )
# calculate mean and std of y for each x bin
mean = numpy.zeros(xbins)
std = numpy.zeros(xbins)
for i in numpy.arange(xbins):
    mean[i] = numpy.mean(y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])
    std[i]  = numpy.std (y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])

Is it possible to have a kind of vectorized writing for it?

Jaime · Accepted Answer · 2013-03-18T13:51:36.650

17

You are complicating things unnecessarily. All you need to know is, for every bin in x, what are n, sy and sy2, the number of y values in that x bin, the sum of those y values, and the sum of their squares. You can get those as:

>>> n, _ = np.histogram(x, bins=xbins)
>>> sy, _ = np.histogram(x, bins=xbins, weights=y)
>>> sy2, _ = np.histogram(x, bins=xbins, weights=y*y)

From those:

>>> mean = sy / n
>>> std = np.sqrt(sy2/n - mean*mean)

edited Mar 18 '13 at 13:51

answered Mar 18 '13 at 13:33

Jaime

65,696
17
124
159

Wow - I did not think of interpreting `y` as "weights" to `x`... Great! – Jakob S. Mar 18 '13 at 13:48
2

@JakobS. Nobody does... until seeing it done for the first time! – Jaime Mar 18 '13 at 13:53
@Jaime. You calculate the `std` as `sqrt(x**2-µ**2)`, when it should be `sqrt((x-µ)**2)`, right? – Swift Feb 12 '19 at 16:43
@Swift Take a look at [this](https://en.wikipedia.org/wiki/Algebraic_formula_for_the_variance#In_terms_of_raw_moments), neither of your formulas is correct without adding some summations somewhere, both can be correct if added in the proper places. I'm pretty confident that the code is conceptually correct. – Jaime Feb 14 '19 at 10:32
@Jaime Thanks! I now understood why yours is the same as in Wikipedia and why it is a very nifty method! Thanks a lot! one more question. Your method uses `sy/N` , where some defintions (like matlabs default) would sometimes use `sy/(N-1)`. Which one would I use for which situations? Thanks! – Swift Mar 06 '19 at 13:59
With `n-1` you get an unbiased estimator of the population standard deviation based on your data sample. With `n` you get the actual standard deviation of your data sample. This can be a subtle difference... I'd say that more often than not you want the former, but it really depends on what you are doing exactly. [Here's](https://en.wikipedia.org/wiki/Variance#Sample_variance) some reading material, the math is kind of complicated, but you don't really need to understand all of it to get the main idea. – Jaime Mar 25 '19 at 17:04

score 1 · Answer 2 · answered Mar 18 '13 at 13:50

1

If you can use pandas:

import pandas as pd
xedges = np.linspace(x.min(), x.max(), xbins+1)
xedges[0] -= 0.00001
xedges[-1] += 0.000001
c = pd.cut(x, xedges)
g = pd.groupby(pd.Series(y), c.labels)
mean2 = g.mean()
std2 = g.std(0)

answered Mar 18 '13 at 13:50

HYRY

94,853
25
187
187

Mean values depending on binning with respect to second variable

2 Answers2

Linked