5

In the typical histogram created with Numpy.histogram or matplotlib.pyplot.hist, the bins are of uniform width or the user specifies his/her own bin edges. There are lots of choices about optimal bin width -- say sqrt(sample size).

Sometimes, there are bins with zero objects in them -- e.g., at the extremes of the histogram. This can be a pain if one wants to look for a correlation -- e.g., if you want to check whether the number of objects in each bin increases as the quantity on the x-axis increases. (Imagine a histogram in which nearly every other bin has effectively 0 objects, or a histogram in which the first and last bins have effectively 0 objects -- both cases lead to poor visualization of the data and make it harder to see any underlying correlation.)

In such cases, it might be beneficial to impose a threshold on the binning such that each bin contains at least N objects. Of course, the bin widths will probably no longer be uniform.

Is there an easy way (i.e., a built-in function) to create such a "thresholded-histogram" in Python, using Numpy, Scipy, or matplotlib? Or at least to split up a monotonic array of numbers such that each sub-array contains at least N numbers?

Also, is such a binning algorithm considered to be optimal (in that the resulting histogram gives you a smoother visualization of where your data is), or sub-optimal (in that you are manipulating the binning to your advantage, rather than merely showing the data as-is)?

quantumflash
  • 691
  • 2
  • 5
  • 16
  • For what it's worth the "optimal" (ish) way to handle this is to use a kernel density estimate (e.g. `scipy.stats.gaussian_kde`) instead of a histogram. That having been said, no, there's nothing built in to `numpy`, `scipy`, or `matplotlib` to generate optimal histogram bins given a dataset and a number of bins. There's more than one way to approach the problem, though... Do you want the bin sizes to be as close to even as possible? Or do you want them to be a function of the density of the data? (Or, of course, something in between...) Just things to think about, at any rate. – Joe Kington Oct 05 '15 at 18:55
  • 1
    Have you looked at Jenks Natural Breaks classifier? There is code for numpy from Cross Validated and lots of discussion on its use in texts and on the web http://stats.stackexchange.com/questions/143974/jenks-natural-breaks-in-python-how-to-find-the-optimum-number-of-breaks –  Oct 05 '15 at 21:03
  • @JoeKington I'm basically creating a relative frequency histogram (normal histogram but with the # of objects in each bin divided by the total # of objects in my sample). This is supposed to represent a "detection fraction" within each of my bins. The problem is that in the ~first and ~last bins, there are only a few objects (say ~1-2 objects), whereas each of the intermediate bins contain >10 objects. This means that those first and last bins have fraction ~ 0. What is the scientifically honest thing to do at this point? Either don't plot those bins, or threshold: each bin has at least N obj? – quantumflash Oct 14 '15 at 14:23
  • Also, I really like Kernel Density Estimation (KDE) compared to histograms but I can't find any easy way to create a weighted KDE (similar to a relative frequency histogram). The y-axis ("density") of a KDE is not as easy to interpret as the fractions on the y-axis of a relative frequency histogram. – quantumflash Oct 14 '15 at 17:32

0 Answers0