Determining optimal bins to bin the data

Question

I have X,Y data which i would like to bin according to X values. However, I would like to determine the optimal number of X bins that satisfy a condition based on the resulting bin intervals and average Y of each bin. For example if i have

X=[2,3,4,5,6,7,8,9,10]

Y=[120,140,143,124,150,140,180,190,200]

I would like to determine the best number of X bins that will satisfy this condition: Average of Y bin/(8* width of X bin) should be above 20, but as close as possible to 20. The bins should also be integers e.g., [1,2,..]. I am currently using:

bin_means, bin_edges, binnumber = binned_statistic(X, Y, statistic='mean', bins=bins)

with bins being pre-defined. However, i would like an algorithim that can determine the optimal bins for me before using this. One can easily determine it for a small data but for hundreds of points it becomes time consuming.

Thank you

See if this helps... https://stats.stackexchange.com/q/798/275865 — RichieV, Jul 27 '20 at 03:26
@RichieV yes, the averge of Y bin is the bin_means. I assume it has to be iterative. Thanks for the link you posted, it seems a similar problem but im interested in satisfying this special condition. Also it seems it gives answers in R language, which im not familiar with — Jamal, Jul 27 '20 at 05:14
Hello @RichieV, I honestly did not continue working with your code. I managed to write a quite long code to do so. The idea is to start with wider bins and iteratively narrow them down until the condition is satisfied. I mean it is long but serves the job, im sure there are way more efficient ways. If interested, i can send it to you. — Jamal, Aug 27 '20 at 02:04

RichieV · Answer 1 · 2020-07-27T14:38:23.197

0

If you NEED to iterate to find optimal nbins with your minimization function, take a look at numpy.digtize

https://numpy.org/doc/stable/reference/generated/numpy.digitize.html

And try:

start = min(X)
stop = max(X)
cut_dict = {
    n: np.digitize(X, bins=np.linspace(start, stop, num=n+1))
    for n in range(min_nbins, max_nbins)}
    #input min/max_nbins
avg = {}
Y = pd.Series(Y).rename('Y')
avg = {nbins: Y.groupby(cut).mean().mean() for nbins, cut in cut_dict.items()}
avg = pd.Series(avg.values(), index=avg.keys()).rename('mean_ybins').to_frame()

Then you can find which is closest to 20 or if 20 is the right number...

edited Jul 27 '20 at 14:38

answered Jul 27 '20 at 13:13

RichieV

5,103
2
11
24

Thanks for sharing the code. Im trying to run it but i get an error avg = {nbins: Y.groupby(cut).mean().mean() for nbins, cut in cut_dict.items} TypeError: 'builtin_function_or_method' object is not iterable – Jamal Jul 27 '20 at 14:30
Right, should be`.items()` with parenthesis – RichieV Jul 27 '20 at 14:39

Determining optimal bins to bin the data

1 Answers1