2

I'm trying to get the equivalent of np.unique, but with an 'axis=1' option.

a = np.array([[8, 8, 8, 5, 8],
       [8, 2, 0, 8, 8],
       [4, 5, 4, 2, 4],
       [4, 6, 5, 2, 6]])

I'm looking to get the value with the highest count in each row and save it to a 1D vector. Basically "which value is most seen in each row."

Correct answer: [8,8,4,6] in this example.

Right now I'm doing something like:

y = np.zeros(len(a))

for i in xrange(len(a)):
    [u,cnt] = np.unique(a[i,:],return_counts=True)
    # pick the value from 'u' that is seen the most.
    y[i] = u[np.argmax(cnt)]

Which gives the desired results but is very slow in Python when looping over thousands of rows. I'm looking for a fully vectorized approach.

I found unique row elements post, but it doesn't quite do what I want (and either I'm not quite clever enough to munge it into the desired form or it's not applicable directly.)

Thank you in advance for any help you can provide.

Community
  • 1
  • 1
Phil Glau
  • 429
  • 5
  • 13
  • Note that unique cannot be vectorized in the way you want: there may be a different number of unqiue elements per row, so the return would have to be a ragged array, which is not an option in NumPy. – Jaime Jun 16 '16 at 08:44

2 Answers2

4

One option is to use scipy.stats.mode:

In [36]: from scipy.stats import mode

In [37]: a
Out[37]: 
array([[8, 8, 8, 5, 8],
       [8, 2, 0, 8, 8],
       [4, 5, 4, 2, 4],
       [4, 6, 5, 2, 6]])

In [38]: vals, counts = mode(a, axis=1)

In [39]: vals
Out[39]: 
array([[8],
       [8],
       [4],
       [6]])

In [40]: counts
Out[40]: 
array([[4],
       [3],
       [3],
       [2]])

However, it is written in Python using numpy, and depending on the distribution of the values in the input, it might not be any faster than your solution. You can find the implementation in https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py (and as I write this, it is here: https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L372).

The essential part of the function depends only on numpy, so if it works well enough for you but you don't want the dependency on scipy, you could copy the function to your own project--just be sure to follow the terms of the BSD license that scipy uses.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • Nice! Yes, much faster. Speed of this depends on the number of possible values in the rows. I'm using CIFAR-10 which contains only 10 possible values so there are only 10 loops. Might not scale as well if there more or as many values are there are rows. – Phil Glau Jun 16 '16 at 02:10
1

A fully vectorized solution can be implemented using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
r = np.indices(a.shape)[0]
(ua, ur), c = npi.unique((a.flatten(), r.flatten()), return_count=True)
u, i = npi.group_by(ur).argmax(c)
y = ua[i]

That is, we first find the unique counts of values in 'a' paired with their row index, and then find the maximum count of such pairs within the groups formed by each row index.

Using only 10 possible values in 'a' I am not certain this is faster than the currently accepted answer, but the time-complexity of this approach is not a function of the number of bits used in 'a', so it should scale better to datasets with a greater number of labels.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42