How does scipy.stats.binned_statistic map the sequence of bin edges to the data on which the statistic will be computed?

Question

Take the following example from the documentation:

rng = np.random.default_rng(seed=3576)
windspeed = 8 * rng.random(500)
boatspeed = .3 * windspeed**.5 + .2 * rng.random(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
                boatspeed, statistic='median', bins=[1,2,3,4,5,6,7])

The first value in bin_means (actually the median is calculated in this case) is 0.48067334, which is the 90th value in the array boatspeed.

I'm really confused as to how this method takes the array of bins and maps it onto the value vector (boatspeed in this case). How does the 90th entry belong to a bin starting from "1" and ending at "2"? Could someone please give an intuitive example or explanation?

It's also not clear to me what the windspeed vector is needed for. According to the documentation, this is "a sequence of values to be binned", but the statistic is being calculated on the second vector, boatspeed, which to me means that we are actually binning boatspeed and windspeed doesn't seem to be used/needed.

Cheers!

score 0 · Answer 1 · answered Aug 13 '22 at 01:05

The example in the documentation for scipy.stats.binned_statistic() analyzes the variable Y (boatspeed) which arises by applying the deterministic function f to the random variable X (windspeed) through Y=f(X). Therefore, the summary statistics (e.g. median) of boatspeed are assessed depending on windspeed. In turn, we have the conditional median at hand rather than the (unconditional) median.

You, however, invoke the unconditional median since

len(boatspeed[boatspeed < 0.48067334])
> 90

In contrast, binned_statistic() computes the conditional median of boatspeed given that the corresponding windspeed values arose in the interval [1,2). This can be confirmed by running

np.quantile(boatspeed[(1 <= windspeed) & (windspeed < 2)], 0.5)
> 0.48067334081468044

and observing the same value. More generally, each conditional median corresponds to an entry in bin_means since

all([np.isclose(np.quantile(boatspeed[(binLower <= windspeed) & (windspeed < binLower+1)], 0.5), binMean) for binLower, binMean in zip(bin_edges, bin_means)])
> True

Intuitively speaking, we obtain an answer to the question: Given the wind speed was of category 1 (i.e. 1<= windspeed<2), what was the corresponding, median boat speed for such wind speed observations?

Moreover, binnumber just provides an array where each index represents the membership of the respective datapoint to a bin. The binning is only concerned with windspeed. This can be verified by

all([max(windspeed[binnumber==i]) < min(windspeed[binnumber==i+1]) for i in range(1,7)])
>True

indicating that the maximum value of the previous windspeed bin is strictly smaller than the minimum value of the current bin. This is what we expect for proper binning the data.

Additionally, we can reproduce the example above by

np.quantile(boatspeed[binnumber==1], 0.5)
> 0.48067334081468044

How does scipy.stats.binned_statistic map the sequence of bin edges to the data on which the statistic will be computed?

1 Answers1