4

Does any one know how to get unique elements row wise in a matrix. For e.g. input matrix may be like:

a = [[1,2,1,3,4,1,3],
     [5,5,3,1,5,1,2],
     [1,2,3,4,5,6,7],
     [9,3,8,2,9,8,4],
     [4,6,7,4,2,3,5]]

It should return the following:

b = rowWiseUnique(a)
=>  b = [[1,2,3,4,0,0,0],
       [5,3,1,2,0,0,0],
       [1,2,3,4,5,6,7],
       [9,3,8,2,4,0,0],
       [4,6,7,2,3,5,0]]

What is the most efficient way of doing this in numpy? I tried the following code, is there a better and shorter way of doing this?

import numpy as np
def uniqueRowElements(row):
    length = row.shape[0]
    newRow = np.unique(row)
    zerosNumb = length-newRow.shape[0]
    zeros = np.zeros(zerosNumb)
    nR = np.concatenate((newRow,zeros),axis=0)
    return nR    

b = map(uniqueRowElements,a)
b = np.asarray(b)
print b
Lanc
  • 880
  • 12
  • 25

5 Answers5

6

Assuming the values in a are floats, you could use:

def using_complex(a):
    weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
    b = a + weight[:, np.newaxis]
    u, ind = np.unique(b, return_index=True)
    b = np.zeros_like(a)
    np.put(b, ind, a.flat[ind])
    return b

In [46]: using_complex(a)
Out[46]: 
array([[1, 2, 0, 3, 4, 0, 0],
       [5, 0, 3, 1, 0, 0, 2],
       [1, 2, 3, 4, 5, 6, 7],
       [9, 3, 8, 2, 0, 0, 4],
       [4, 6, 7, 0, 2, 3, 5]])

Note that using_complex does not return the unique values in the same order as rowWiseUnique; per the comments underneath the question, sorting the values is not required.


The most efficient method may depend on the number of rows in the array. Methods that use map or a for-loop to handle each row separately are good if the number of rows is not too large, but if there are lots of rows, you can do better by using a numpy trick to handle the entire array with one call to np.unique.

The trick is to add a unique imaginary number to each row. That way, when you call np.unique, the floats in the original array will be recognized as different values if they occur in different rows, but be treated as the same value if they occur in the same row.

Below, this trick is implemented in the function using_complex. Here is a benchmark comparing rowWiseUnique, the original method, with using_complex and solve:

In [87]: arr = np.random.randint(10, size=(100000, 10))

In [88]: %timeit rowWiseUnique(arr)
1 loops, best of 3: 1.34 s per loop

In [89]: %timeit solve(arr)
1 loops, best of 3: 1.78 s per loop

In [90]: %timeit using_complex(arr)
1 loops, best of 3: 206 ms per loop

import numpy as np

a = np.array([[1,2,1,3,4,1,3],
     [5,5,3,1,5,1,2],
     [1,2,3,4,5,6,7],
     [9,3,8,2,9,8,4],
     [4,6,7,4,2,3,5]])

def using_complex(a):
    weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
    b = a + weight[:, np.newaxis]
    u, ind = np.unique(b, return_index=True)
    b = np.zeros_like(a)
    np.put(b, ind, a.flat[ind])
    return b

def rowWiseUnique(a):
    b = map(uniqueRowElements,a)
    b = np.asarray(b)
    return b

def uniqueRowElements(row):
    length = row.shape[0]
    newRow = np.unique(row)
    zerosNumb = length-newRow.shape[0]
    zeros = np.zeros(zerosNumb)
    nR = np.concatenate((newRow,zeros),axis=0)
    return nR    

def solve(arr):
    n = arr.shape[1]
    new_arr = np.empty(arr.shape)
    for i, row in enumerate(arr):
        new_row = np.unique(row)
        new_arr[i] = np.hstack((new_row, np.zeros(n - len(new_row))))
    return new_arr
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Interesting, seems optimized for smaller lengths in the second axis - from timings anyway. I'm going to have to noodle on this one. – wwii Nov 17 '14 at 04:44
  • @unutbu: Would it be also possible to generate an array of counts in the same shape as `a` with 0's in the positions where values were duplicated? – sriramn Mar 01 '15 at 00:56
  • Off-hand, I don't know. Above, I tried to reduce the problem to *one call* to np.unique by differentiating the rows with complex numbers. I'm not sure what the analogous trick would be for your problem. I'm not entirely clear about the details of your problem either -- if there are 0's where the values are duplicated, where are the counts? Would the other positions all be 1's? – unutbu Mar 01 '15 at 01:07
  • By the way, since the goal of this website is to build a searchable archive of good questions matched with good answers, it goes against this purpose to ask new questions in the comments. Please post a new question. I'd be happy to look at it, but I might not have an answer. – unutbu Mar 01 '15 at 01:10
  • I was looking at `np.unique` documentation and in ver 1.9 there is an additional argument `return_counts` that returns the counts of each unique value. So just for the first row in the example `[1,2,1,3,4,1,3]` I would like something like this: [3,1,0,2,1,0,0]. But I can't figure out how to reconstruct a 2D array from the flat array that is returned... – sriramn Mar 01 '15 at 01:11
  • Got it - please see here: http://stackoverflow.com/questions/28789014/count-unique-elements-row-wise-in-an-ndarray – sriramn Mar 01 '15 at 01:20
  • Note that the proposed solution is in a worse runtime class than the OPs initial solution. Since np.unique uses sorting, I assume it runs in O(n * log(n)). It follows that the proposed solution runs in O(n * m * log(n * m)). The solutoin of the OP runs in O(n * m * log(m)), however. Here i suppose the original array has shape (n, m). Therefore, a solution using map might be better for large matrices. – Samufi Feb 25 '17 at 20:54
2

The fastest way should be to set all duplicates to zero using sort and diff:

def row_unique(a):
    unique = np.sort(a)
    duplicates = unique[:,  1:] == unique[:, :-1]
    unique[:, 1:][duplicates] = 0
    return unique

This is about 3 times as fast as the unutbu's solution on my computer:

In [26]: a = np.random.randint(1, 101, size=100000).reshape(1000, 100)

In [27]: %timeit row_unique(a)
100 loops, best of 3: 3.18 ms per loop

In [28]: %timeit using_complex(a)
100 loops, best of 3: 15.4 ms per loop

In [29]: assert np.all(np.sort(using_complex(a)) == np.sort(row_unique(a)))

In order to return the counts of each unique element, one could also do:

def row_unique(a, return_counts=False):
    unique = np.sort(a)
    duplicates = unique[:,  1:] == unique[:, :-1]
    unique[:, 1:][duplicates] = 0
    if not return_counts:
        return unique
    count_matrix = np.zeros(a.size, dtype="int")
    idxs = np.flatnonzero(unique)
    counts = np.diff(idxs)
    count_matrix[idxs[:-1]] = counts
    count_matrix[idxs[-1]] = a.size-idxs[-1]
    return unique, count_matrix.reshape(a.shape)

`

kuppern87
  • 1,125
  • 9
  • 14
1

You can do something like this:

def solve(arr):
    n = arr.shape[1]
    new_arr = np.empty(arr.shape)
    for i, row in enumerate(arr):
        new_row = np.unique(row)
        new_arr[i] = np.hstack((new_row, np.zeros(n - len(new_row))))
    return new_arr

This is around 4X times faster than OP's current code for 1000 X 1000 array:

>>> arr = np.arange(1000000).reshape(1000, 1000)
>>> %timeit b = map(uniqueRowElements, arr); b = np.asarray(b)
10 loops, best of 3: 71.2 ms per loop
>>> %timeit solve(arr)
100 loops, best of 3: 16.6 ms per loop
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • Thanks @Ashwini. If I am not wrong, your code is gaining by preallocating the output array and in my code lot of time is spent in adjusting the output array size? – Lanc Nov 16 '14 at 16:34
  • @Lanc Yes, exactly. `np.empty` is taking around only 400 ns for me. In your code the conversion of `b` to NumPy array is the slower part, `map(uniqueRowElements, arr)` takes around 14 ms on my system, rest of it is taken by `np.asarray`. – Ashwini Chaudhary Nov 16 '14 at 16:50
  • Try ```np.apply_along_axis(uniqueRowElements, 1, a)```. – wwii Nov 16 '14 at 16:52
  • @wwii That took around 19.5 ms. – Ashwini Chaudhary Nov 16 '14 at 16:53
  • 1
    I can't reproduce the speed gain, I have 96.3 ms for OP's 86.2 ms for yours. – sebix Nov 16 '14 at 16:54
  • 1
    @sebix I get timing (ratios) similar to @Ashwini with ```timeit.Timer``` – wwii Nov 16 '14 at 16:58
1

A variation on OP's solution with a slight improvement, ~3% when using numpy.apply_along_axis with large (1000x1000) arrays - but still a bit slower than @Ashwini's solution.

def foo(row):
    b = np.zeros(row.shape)
    u = np.unique(row)
    b[:u.shape[0]] = u
    return b

b = np.apply_along_axis(foo, 1, a)

Timing ratios seem to be a bit closer using an array with duplicates in the rows, a = np.random.random_integers(0, 500, (1000*1000)).reshape(1000,1000).

wwii
  • 23,232
  • 7
  • 37
  • 77
0

It's not very efficient, because moving all zeros into a row's end can't be very efficient.

import numpy as np

a = np.array([[1,2,1,3,4,1,3],
     [5,5,3,1,5,1,2],
     [1,2,3,4,5,6,7],
     [9,3,8,2,9,8,4],
     [4,6,7,4,2,3,5]])

row_len = len(a[0])

for r in xrange(len(a)):
    found = set()
    for i in xrange(row_len):
        if a[r][i] not in found:
            found.add(a[r][i])
        else:
            a[r][i] = 0
    a[r].sort()
    a[r] = a[r][::-1]

print(a)

Output:

[[4 3 2 1 0 0 0]
 [5 3 2 1 0 0 0]
 [7 6 5 4 3 2 1]
 [9 8 4 3 2 0 0]
 [7 6 5 4 3 2 0]]
Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73
  • Using for loops is not recommended, if I have a huge matrix the above code is going to be very slow. I tried something in code above, have a look and please suggest if there is some better way of doing this. – Lanc Nov 16 '14 at 16:00
  • @Lanc both numpy.unique and map() use 'for'. I don't think there is a way to accomplish your task without 'for'. – Eli Korvigo Nov 16 '14 at 16:18