3

I want to bin my data into 10 bins (histograms) using percentile ranges:

bins = [0, 10th-percentile(myData), 20th-percentile(myData), 30th..., 90th-percentile(myData), +inf]

So in order make a histogram out of my data, I just do:

import numpy as np
myBinnedData = np.histogram(myData, bins=bins)[0]

My problem is that I have several ties in myData and whenever a tie spans two bins or more, np.histogram will just put all the values in the first bin and leave the second one empty.

This is because the bin ranges will have two consecutive equal values (X-percentile(myData) == Y-percentile(myData)

How can I account for this?

Ricky Robinson
  • 21,798
  • 42
  • 129
  • 185
  • 1
    I think I understand what you're getting at. You'll need some basis for breaking the ties. (Otherwise it's not "fair" -- which samples go into the lower and upper bins?) Or, it might help to answer the question, what will you do with the data after it is binned the way you want? – Dan Allan Jun 17 '13 at 19:42
  • I will apply the Chi-Square test afterwards. In principles, if a tie spans from a fraction of a bin to the whole next one, I should be able to count the number of values falling in each bin, right? It's just that the tools I'm provided with won't let me because of all the above. But if all the values between, say 30th-percentile and 20th-percentile are `k`, I know that 10% of all my values are `k`. And this 10% falls into that bin. Is this approach correct? – Ricky Robinson Jun 18 '13 at 08:16
  • 2
    Right. I think you have simple sort the array and split it into chunks comprising 10% of the data (or whatever evenly-divisible chunk is close to 10%) you'll have what you want. – Dan Allan Jun 18 '13 at 12:54

0 Answers0