7

I am using the following code to digitize an array into 16 bins:

numpy.digitize(array, bins=numpy.histogram(array, bins=16)[1])

I expect that the output is in the range [1, 16], since there are 16 bins. However, one of the values in the returned array is 17. How can this be explained?

sandesh247
  • 1,658
  • 1
  • 18
  • 24

4 Answers4

8

This is actually documented behaviour of numpy.digitize():

Each index i returned is such that bins[i-1] <= x < bins[i] if bins is monotonically increasing, or bins[i-1] > x >= bins[i] if bins is monotonically decreasing. If values in x are beyond the bounds of bins, 0 or len(bins) is returned as appropriate.

So in your case, 0 and 17 are also valid return values (note that the bin array returned by numpy.histogram() has length 17). The bins returned by numpy.histogram() cover the range array.min() to array.max(). The condition given in the docs shows that array.min() belongs to the first bin, while array.max() lies outside the last bin -- that's why 0 is not in the output, while 17 is.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 1
    Hmm, I do know about the edge case behavior of digitize(). However, since i am using histogram() to create the bins, aren't all values supposed to lie within the bins? – sandesh247 Dec 04 '10 at 23:18
  • As I explained in my answer, `array.min()` is supposed to lie in the first bin because it satisfies the `bins[0] <= array.min() < bins[1]` condition, but `array.max()` does not fulfil `bins[15] <= array.max() < bins[16]`, so it's not in the last bin. – Sven Marnach Dec 05 '10 at 00:53
  • Thanks for your patience. The behavior of the `bins` argument for `numpy.histogram` is different (the last interval is a closed interval), which led to the confusion. – sandesh247 Dec 05 '10 at 05:32
  • @sandesh247 I agree with you that making the max value not belong to the histogram seems silly, to say the least. I just hit me when I was computing entropies and things started failing after np.digitize, as there always was 1 outlier no matter number of bins and right=True/False I tried. – Anatoly Alekseev Aug 25 '23 at 09:11
2

In numpy version 1.8.,you have an option to select whether you want numpy.digitize to consider the interval to be closed or open. Following is an example (copied from http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html)

x = np.array([1.2, 10.0, 12.4, 15.5, 20.])

bins = np.array([0,5,10,15,20])

np.digitize(x,bins,right=True)

array([1, 2, 3, 4, 4])

2

numpy.histogram() produces an array of the bin edges, of which there are (number of bins)+1.

Andrew Jaffe
  • 26,554
  • 4
  • 50
  • 59
0

Ok, I found a recipe to discretize an array with numpy. Problem is, np.histogram_bin_edges (and, therefore, np.histogram) and np.digitize are not consistent in how they use bins edges: first 2 always return an extra edge, what ever right mode you use in np.digitize, which always leaves you with one "outlier" category. What one has to do is (assuming edges appear in ascending order)

bin_edges=np.histogram_bin_edges(arr,bins=4) #or any other source
if bin_edges[0] <= arr.min():
 categorized_arr=np.digitize(arr,bins=bin_edges[1:],right=True)
elif bin_edges[-1] >= arr.max():
 categorized_arr=np.digitize(arr,bins=bin_edges[:-1],right=False)
Anatoly Alekseev
  • 2,011
  • 24
  • 27