3

An extension to this question. In addition to having the unique elements row-wise, I want to have a similarly shaped array that gives me the count of unique values. For example, if the initial array looks like this:

a = np.array([[1,  2, 2, 3,  4, 5],
              [1,  2, 3, 3,  4, 5],
              [1,  2, 3, 4,  4, 5],
              [1,  2, 3, 4,  5, 5],
              [1,  2, 3, 4,  5, 6]])

I would like to get this as the output from the function:

np.array([[1,  2, 0, 1,  1, 1],
          [1,  1, 2, 0,  1, 1],
          [1,  1, 1, 2,  0, 1],
          [1,  1, 1, 1,  2, 0],
          [1,  1, 1, 1,  1, 1]])

In numpy v.1.9 there seems to be an additional argument return_counts that can return the counts in a flattened array. Is there some way this can be re-constructed into the original array dimensions with zeros where values were duplicated?

JamCon
  • 2,313
  • 2
  • 25
  • 34
sriramn
  • 2,338
  • 4
  • 35
  • 45

2 Answers2

2

The idea behind this answer is very similar to the one used here. I'm adding a unique imaginary number to each row. Therefore, no two numbers from different rows can be equal. Thus, you can find all the unique values in a 2D array per row with just one call to np.unique.

The index, ind, returned when return_index=True gives you the location of the first occurrence of each unique value.

The count, cnt, returned when return_counts=True gives you the count.

np.put(b, ind, cnt) places the count in the location of the first occurence of each unique value.

One obvious limitation of the trick used here is that the original array must have int or float dtype. It can not have a complex dtype to start with, since multiplying each row by a unique imaginary number may produce duplicate pairs from different rows.


import numpy as np

a = np.array([[1,  2, 2, 3,  4, 5],
              [1,  2, 3, 3,  4, 5],
              [1,  2, 3, 4,  4, 5],
              [1,  2, 3, 4,  5, 5],
              [1,  2, 3, 4,  5, 6]])

def count_unique_by_row(a):
    weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
    b = a + weight[:, np.newaxis]
    u, ind, cnt = np.unique(b, return_index=True, return_counts=True)
    b = np.zeros_like(a)
    np.put(b, ind, cnt)
    return b

yields

In [79]: count_unique_by_row(a)
Out[79]: 
array([[1, 2, 0, 1, 1, 1],
       [1, 1, 2, 0, 1, 1],
       [1, 1, 1, 2, 0, 1],
       [1, 1, 1, 1, 2, 0],
       [1, 1, 1, 1, 1, 1]])
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Very interesting. How exactly does `unravel_index` work? `ind` is a 1D array and this function seems to convert that to a 2-D index compatible with `a`... – sriramn Mar 01 '15 at 01:37
  • I recently wrote a little [explanation of `unravel_index`](http://stackoverflow.com/a/28745720/190597). See also [the docs](http://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html). Actually, however, I just realised `unravel_index` is not needed here. `np.put` can place the `cnt` values into `b` using the `ind` flat indices directly. I'll edit the post to show what I mean. – unutbu Mar 01 '15 at 01:44
  • Your guess about what `unravel_index` is doing is exactly right. [The `ravel()` method](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) returns a flattened 1D view of an N-dimensional array. The indices in `ind` are indices into this flattened 1D view of `a`. `unravel_index` converts these indices into the corresponding coordinate indices you would use to index into the N-dimensional array `a`. – unutbu Mar 01 '15 at 01:53
1

This method does the same as np.unique for each row, by sorting each row and getting the length of consecutive equal values. This has complexity O(NMlog(M)) which is better than running unique on the whole array, since that has complexity O(NM(log(NM))

def row_unique_count(a):                                    
     args = np.argsort(a)
     unique = a[np.indices(a.shape)[0], args]
     changes = np.pad(unique[:, 1:] != unique[:, :-1], ((0, 0), (1, 0)), mode="constant", constant_values=1)
     idxs = np.nonzero(changes)
     tmp = np.hstack((idxs[-1], 0))
     counts = np.where(tmp[1:], np.diff(tmp), a.shape[-1]-tmp[:-1])
     count_array = np.zeros(a.shape, dtype="int")
     count_array[(idxs[0], args[idxs])] = counts
     return count_array

Running times:

In [162]: b = np.random.random(size=100000).reshape((100, 1000))

In [163]: %timeit row_unique_count(b)
100 loops, best of 3: 10.4 ms per loop

In [164]: %timeit count_unique_by_row(b)
100 loops, best of 3: 19.4 ms per loop

In [165]: assert np.all(row_unique_count(b) == count_unique_by_row(b))
kuppern87
  • 1,125
  • 9
  • 14