Indexing the unique rows of an array

Question

I would like to get the indices of the unique rows in an array. A unique row should have its own index (starting with zero). Here is an example:

import numpy as np

a = np.array([[ 0.,  1.],
              [ 0.,  2.],
              [ 0.,  3.],
              [ 0.,  1.],
              [ 0.,  2.],
              [ 0.,  3.],
              [ 0.,  1.],
              [ 0.,  2.],
              [ 0.,  3.],
              [ 1.,  1.],
              [ 1.,  2.],
              [ 1.,  3.],
              [ 1.,  1.],
              [ 1.,  2.],
              [ 1.,  3.],
              [ 1.,  1.],
              [ 1.,  2.],
              [ 1.,  3.]])

In the above array there are six unique rows:

import pandas as pd
b = pd.DataFrame(a).drop_duplicates().values

    array([[ 0.,  1.],   
           [ 0.,  2.],
           [ 0.,  3.],
           [ 1.,  1.],
           [ 1.,  2.],
           [ 1.,  3.]])

Each row represents an index (0, 1, 2, 3, 4 ,5). In order to get the indices of unique rows in array a, the result would be:

[0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 4, 5, 3, 4, 5, 3, 4, 5]

How can I get to this result in an efficient way?

`pd.DataFrame(a).drop_duplicates().index` will return you an index of your unique rows in the original NP array - is that what you want? — MaxU - stand with Ukraine, Mar 27 '16 at 17:02
No, this is not what I want. This returns the position where the unique rows first appear. — blaz, Mar 27 '16 at 17:09
You seem to be asking for a multi-column `factorize`: see this question and answer http://stackoverflow.com/questions/16453465/multi-column-factorize-in-pandas — Alex Riley, Mar 27 '16 at 17:27

score 3 · Accepted Answer · answered Mar 27 '16 at 17:35

A pure numpy solution :

av = a.view(np.complex)
_,inv = np.unique(av,return_inverse=True)

Then inv is :

array([0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 4, 5, 3, 4, 5, 3, 4, 5], dtype=int64)

np.complexis for packing the two components, preserving order. for other types, other approaches are possible.

score 0 · Answer 2 · answered Mar 27 '16 at 17:24

Solution without numpy and pandas:

a = [[0, 1],
     [0, 2],
     [0, 3],
     [0, 1],
     [0, 2],
     [0, 3],
     [0, 1],
     [0, 2],
     [0, 3],
     [1, 1],
     [1, 2],
     [1, 3],
     [1, 1],
     [1, 2],
     [1, 3],
     [1, 1],
     [1, 2],
     [1, 3]]

b = []

#= ALGORITHM

point = -1                                               # Increment
cache = [[-1 for x in range(1000)] for x in range(1000)] # Change to dynamic

for i in a:
    x = i[0]; y = i[1]

    # Check what's going on here...
    # print("x: {0} y: {1} --> {2} (cache)".format(x, y, cache[x][y]))

    if cache[x][y] == -1:
        point += 1
        cache[x][y] = point
        b.append(point)
    else:
        b.append(cache[x][y])

#= TESTING

print(b) # [0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 4, 5, 3, 4, 5, 3, 4, 5]

score 0 · Answer 3 · edited Mar 27 '16 at 17:36

0

This is what I got:

b = pd.DataFrame(a).drop_duplicates()
indexed_rows = np.zeros(a.shape[0], dtype=int)
for index, i in enumerate(a):
    for unique_row, j in enumerate(b.values):
        if np.all(i==j):
            indexed_rows[index] = unique_row

The returned result is:

array([0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 4, 5, 3, 4, 5, 3, 4, 5])

edited Mar 27 '16 at 17:36

Maciej A. Czyzewski

1,539
1
13
24

answered Mar 27 '16 at 17:25

blaz

4,108
7
29
54

It's not a effective way... `b` variable is not defined (`b = pd.DataFrame(a).drop_duplicates()`) – Maciej A. Czyzewski Mar 27 '16 at 17:33

Indexing the unique rows of an array

3 Answers3