2

For example, I have a stream of array with numbers ranging from 0.0 to 10.0 inclusive.

I want to assign the numbers in arr to 5 bins of equal length quickly.

By equal length I mean the bin intervals are [0.0, 2.0), [2.0, 4.0), [4.0, 6.0), [6.0, 8.0), [8.0, 10.0].

The problem is that the last interval is not same as the other intervals.

Test:

import numpy as np
# Things we know and can pre-calculate
n_bins = 5
minimal = 0.0  
maximal = 10.0
reciprocal_bin_length = n_bins / (maximal - minimal)

# Let's say the stream gives 1001 numbers every time.
data = np.arange(1001)/100

norm_data = (data - minimal) * reciprocal_bin_length
norm_data = norm_data.astype(int)
print(norm_data.max())
print(norm_data.min())

Result:

5
0

The bin index should be 0, 1, 2, 3, or 4, but not 5.

hamster on wheels
  • 2,771
  • 17
  • 50
  • I think you may be looking for [numpy.digitize](https://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html). –  Jul 21 '17 at 15:16
  • min is 0. max is exactly 10 that way. digitize allows bin of uneven length and might be slower. this is equal length. – hamster on wheels Jul 21 '17 at 15:17
  • what about [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html) – jeremycg Jul 21 '17 at 15:24
  • @jeremycg I tried to read the source code of pandas.cut. It seems to adjust the max beforehand, like multiplying the max by 1.001. That gives some small error in binning, but don't need to clip afterwards. – hamster on wheels Jul 21 '17 at 15:33
  • Your logic is fine, you are just missing one line at the end: `norm_data[norm_data >= n_bins] = n_bins -1`, equivalent to the posted solution. – Imanol Luengo Jul 21 '17 at 15:33

2 Answers2

3

A "poor man's solution" could be to calculate the minimum between your array norm_data and nbins-1:

norm_data = np.minimum(norm_data,nbins-1)

So all 5s (and above) will be converted into 4s. Mind that of course here you will not do a proper range check (120.0 will also end up in bin 4).

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
0

If 0.1% error is acceptable, the following is a bit faster. Not sure if this is fine with floating point rounding.

import numpy as np
# Things we know and can pre-calculate
n_bins = 5
minimal = 0.0  
maximal = 10.0
approx = 1.001  # <-- this is new
reciprocal_bin_length = n_bins / (maximal*approx - minimal)

# Let's say the stream gives 1001 numbers every time.
data = np.arange(1001)/100

# can use numexpr for speed.
norm_data = (data - minimal) * reciprocal_bin_length
norm_data = norm_data.astype(int)
print(norm_data.max())
print(norm_data.min())
hamster on wheels
  • 2,771
  • 17
  • 50