5

I am working with python / numpy. As input data I have a large number of value pairs (x,y). I basically want to plot <y>(x), i.e., the mean value of y for a certain data bin x. At the moment I use a plain for loop to achieve this, which is terribly slow.

# create example data
x = numpy.random.rand(1000)
y = numpy.random.rand(1000)
# set resolution
xbins = 100
# find x bins
H, xedges, yedges = numpy.histogram2d(x, y, bins=(xbins,xbins) )
# calculate mean and std of y for each x bin
mean = numpy.zeros(xbins)
std = numpy.zeros(xbins)
for i in numpy.arange(xbins):
    mean[i] = numpy.mean(y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])
    std[i]  = numpy.std (y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])

Is it possible to have a kind of vectorized writing for it?

Jakob S.
  • 1,851
  • 2
  • 14
  • 29

2 Answers2

17

You are complicating things unnecessarily. All you need to know is, for every bin in x, what are n, sy and sy2, the number of y values in that x bin, the sum of those y values, and the sum of their squares. You can get those as:

>>> n, _ = np.histogram(x, bins=xbins)
>>> sy, _ = np.histogram(x, bins=xbins, weights=y)
>>> sy2, _ = np.histogram(x, bins=xbins, weights=y*y)

From those:

>>> mean = sy / n
>>> std = np.sqrt(sy2/n - mean*mean)
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Wow - I did not think of interpreting `y` as "weights" to `x`... Great! – Jakob S. Mar 18 '13 at 13:48
  • 2
    @JakobS. Nobody does... until seeing it done for the first time! – Jaime Mar 18 '13 at 13:53
  • @Jaime. You calculate the `std` as `sqrt(x**2-µ**2)`, when it should be `sqrt((x-µ)**2)`, right? – Swift Feb 12 '19 at 16:43
  • @Swift Take a look at [this](https://en.wikipedia.org/wiki/Algebraic_formula_for_the_variance#In_terms_of_raw_moments), neither of your formulas is correct without adding some summations somewhere, both can be correct if added in the proper places. I'm pretty confident that the code is conceptually correct. – Jaime Feb 14 '19 at 10:32
  • @Jaime Thanks! I now understood why yours is the same as in Wikipedia and why it is a very nifty method! Thanks a lot! one more question. Your method uses `sy/N` , where some defintions (like matlabs default) would sometimes use `sy/(N-1)`. Which one would I use for which situations? Thanks! – Swift Mar 06 '19 at 13:59
  • With `n-1` you get an unbiased estimator of the population standard deviation based on your data sample. With `n` you get the actual standard deviation of your data sample. This can be a subtle difference... I'd say that more often than not you want the former, but it really depends on what you are doing exactly. [Here's](https://en.wikipedia.org/wiki/Variance#Sample_variance) some reading material, the math is kind of complicated, but you don't really need to understand all of it to get the main idea. – Jaime Mar 25 '19 at 17:04
1

If you can use pandas:

import pandas as pd
xedges = np.linspace(x.min(), x.max(), xbins+1)
xedges[0] -= 0.00001
xedges[-1] += 0.000001
c = pd.cut(x, xedges)
g = pd.groupby(pd.Series(y), c.labels)
mean2 = g.mean()
std2 = g.std(0)
HYRY
  • 94,853
  • 25
  • 187
  • 187