3

Given a matrix of values that represent probabilities I am trying to write an efficient process that returns the bin that the value belongs to. For example:

sample = 0.5
x = np.array([0.1]*10)
np.digitize( sample, np.cumsum(x))-1
#returns 5

is the result I am looking for. According to timeit for x arrays with few elements it is more efficient to do it as:

cdf = 0
for key,val in enumerate(x):
    cdf += val
    if sample<=cdf:
        print key
        break

while for bigger x arrays the numpy solution is faster. The question:

  1. Is there a way to further accelerate it, e.g., a function that combines the steps?
  2. Can we vectorize the process it for the case where sample is a list, whose each item is associated with its own x array (x will then be 2-D)?

In the application x contains the marginal probabilities; this is way I need to decrement the results of np.digitize

geompalik
  • 1,582
  • 11
  • 22

1 Answers1

2

You could use some broadcasting magic there -

(x.cumsum(1) > sample[:,None]).argmax(1)-1

Steps involved :

I. Perform cumsum along each row.

II. Use broadcasted comparison for each cumsum row against each sample value and look for the first occurrence of sample being lesser than cumsum values, signalling that the element before that in x is the index we are looking for.

Step-by-step run -

In [64]: x
Out[64]: 
array([[ 0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ],
       [ 0.8 ,  0.96,  0.88,  0.36,  0.5 ,  0.68,  0.71],
       [ 0.37,  0.56,  0.5 ,  0.01,  0.77,  0.88,  0.36],
       [ 0.62,  0.08,  0.37,  0.93,  0.65,  0.4 ,  0.79]])

In [65]: sample # one elem per row of x
Out[65]: array([ 0.5,  2.2,  1.9,  2.2])

In [78]: x.cumsum(1)
Out[78]: 
array([[ 0.1 ,  0.2 ,  0.3 ,  0.4 ,  0.5 ,  0.6 ,  0.7 ],
       [ 0.8 ,  1.76,  2.64,  2.99,  3.49,  4.18,  4.89],
       [ 0.37,  0.93,  1.43,  1.45,  2.22,  3.1 ,  3.47],
       [ 0.62,  0.69,  1.06,  1.99,  2.64,  3.04,  3.83]])

In [79]: x.cumsum(1) > sample[:,None]
Out[79]: 
array([[False, False, False, False, False,  True,  True],
       [False, False,  True,  True,  True,  True,  True],
       [False, False, False, False,  True,  True,  True],
       [False, False, False, False,  True,  True,  True]], dtype=bool)

In [80]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[80]: array([4, 1, 3, 3])

# A loopy solution to verify results against
In [81]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[81]: [4, 1, 3, 3]

Boundary cases :

The proposed solution automatically handles the cases where sample values are lesser than smallest of cumulative summed values -

In [113]: sample[0] = 0.08  # editing first sample to be lesser than 0.1

In [114]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[114]: [-1, 1, 3, 3]

In [115]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[115]: array([-1,  1,  3,  3])

For cases where a sample value is greater than largest of cumulative summed values, we need one extra step -

In [116]: sample[0] = 0.8  # editing first sample to be greater than 0.7

In [121]: mask = (x.cumsum(1) > sample[:,None])

In [122]: idx = mask.argmax(1)-1

In [123]: np.where(mask.any(1),idx,x.shape[1]-1)
Out[123]: array([6, 1, 3, 3])

In [124]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[124]: [6, 1, 3, 3]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This is great. I hadn't thought it in that way. I will dive into it! – geompalik Jan 27 '17 at 13:43
  • 1
    If you have `N` bins and `K` samples, this creates an array of size `N*K`, so you are basically turning a problem that is inherently linearithmic with `O(K * log(N))` into a quadratic one with `O(N*K)`. This pays off for small `K` or `N` because Python and NumPy and all that, but it is very important to be aware of the trade-offs you are making. – Jaime Jan 27 '17 at 13:49
  • Thank you Jaime. You are referring to the memory requirements and the trade-off between speed and memory, right? – geompalik Feb 07 '17 at 07:34