-1

I have an imbalanced numeric data set that looks like this:

Data set.

I need to bin the data into 8 bins, however if I set the bins to have equal size, I would get all my data only into two bins and the rest in the middle would be empty.

Is there a statistical or mathematical method that would discretize data with fine grained bins when there is a lot of data points, and then make it more coarse grained bins when there is few data points?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
  • this is an x -> x function, so why do you need the bucketing for ? – eliasah Feb 08 '16 at 06:57
  • The plot is just to visualize how my data looks like, I have a vector of numeric values: (length=4964, min=1, max= 7478, mean=5.045, stdDev=106.6) and I want to discretize them into 8 bins. – Enas Ahmad Feb 08 '16 at 07:42

1 Answers1

0

You can sort the data and bin it according to rank. sometimes also called "depth". So if your data after sorting is

[1, 2, 4, 8, 16, 32]

and you wanted three bins. you would use

[1, 2] [4, 8] [16, 32]

but a good logic to define bin centers and borders is hard? You would probably use the means: 1.5, 6, 24 as centers and the half-way values of the largest and smallest values as cell borders: [1:3] [3:12] and [12:32].

bin sizes are no longer interesting because they are all expected to be equally big? But if you have more than one variable, combinations of bins may be below average or above expected. i.e. indicate some dependency among the variables.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194