Optimal Bucket Size and No. of Buckets

Question

Sorry this post is not related to coding but more to data structures and Algorithms. I'm having large amount of data each having different frequencies. The approximate figure plot seems to be a Bell curve. I now want to display the data in ranges which most precisely describes the frequency of the ranges. e.g. the entire range of data has total no. of frequencies but this range or bucket size is not precise and may be made more precise.(e.g if some data is more concentrated in a particular frequency zone, we may build up a bucket with less data size but having more closely related frequencies.)
Any help regarding some algorithm . I thought of an algorithm related to binary search. Any ideas folks.

amit · Answer 1 · 2012-06-06T11:46:21.697

4

Not sure I am following, but it seems you are looking for k beans, where for each two beans, the probability of the data falling in one bean is identical for it being in the other bean.

From your description, your data seems to be normally distributed, or T-distributed.

One can evaluate the mean and standard deviation of the data, let the extracted S.D. be s and the mean be u.

The standard formulas for evaluating the mean and S.D. from the sample are¹:

u = (x1 + x2 + ... + xn) / n (simple average)
s^2 = Sigma((xi - u)^2)/(n-1)

Given this information, you can evaluate the distribution of your data, which is N(u,s^2). Given this information, you can create a random variabe: X~N(u,s^2)²

Now all is left is finding the a,b,... as follows (assuming 10 buckets, this can obviously be modified as you wish):

P(X<a) = 0.1
P(X<b) = 0.2
P(X<c) = 0.3
...

After finding a,b,c,... you have your beans: (-infinity,a], (a,b], (a,c], ...

(1) evaluating variance: http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance
(2)The real distribution for this variable is actually t-distribution, since the variance is unknown - and extracted from the data. However - for large enough n - t-distribution decays into normal distribution.

edited Jun 06 '12 at 11:46

answered Jun 05 '12 at 07:02

amit

175,853
27
231
333

Thanks for the idea.I'm clear about bucket size now. Well worth it, but I want to maximum no. of useful buckets. Suppose two consecutive data sets have low frequencies, it will be more useful to club them into one dataset having higher added frequency. Essentially, I'm trying to say that no. of buckets is a dynamic variable and has to chosen according to the data sets and their frequencies. There must be some algorithm for choosing the no. of buckets. 10 buckets (assume) might look good in one case and might not in another. – user1425322 Jun 05 '12 at 08:34
@user1425322: This approach will give you `k` data sets, which all are expected to have the added frequency. The `k` in here is a parameter you need to predefine. – amit Jun 05 '12 at 08:36
All i have is a data set and their frequencies which look like a normal curve (like you said).But i want some buckets (number is unknown) which accommodate all the datasets in the best possible manner. Having fixed no. of buckets would result in a dull representation in case of different datasets. Any idea or algo regarding this. – user1425322 Jun 05 '12 at 08:43
You say "best possible manner", but best according to what criteria? That's the key information that's missing in the question. – Chris Okasaki Jun 05 '12 at 12:52
Its intuitive to be frank.Like the bucket size may be very narrow at the most dense region and wide but not much wide at the low frequency regions. I just want an estimate of how can i select varying no. of buckets with varying widths so that the dense regions have narrow buckets and sparse regions have wide buckets , some algorithm not an accurate analysis. – user1425322 Jun 06 '12 at 08:58
@user1425322 I think this answer gives you precisely what you describe. Having recovered the distribution parameters, you will find the boundaries between your buckets, for probability say 1/6, 2/6, 3/6, .. 1, so that each will house same 1/6 amount of your data. At the top of your distribution you will naturally get a narrow bucket; towards the edges the buckets will be wider. Do you want the top point of the curve to be in the middle of top bucket - is that your concern? Or do you want all your buckets to be of same size (what about the edges then)? – Will Ness Jun 06 '12 at 11:50

score -1 · Answer 2 · edited May 24 '17 at 12:32

-1

First count all the indexes then subtract the repeating values this will give you optimal number of buckets. but at small level

edited May 24 '17 at 12:32

Abdul Malik

2,632
1
18
31

answered May 23 '17 at 19:20

Ricky

114
1
10

Optimal Bucket Size and No. of Buckets

2 Answers2