Remove duplicate values from numpy structured array

Question

I have a structured array v such as

import numpy as np
v = np.zeros((3,3), [('a1', np.int),('a2', np.int), ('a3', np.int),  
    ('a4', np.int), ('a5', np.int), ('a6', np.int)])

Usually v would be much larger, with the 'a1', ..., 'a6' values computed by other routines. Let's say that v is

>>> print v
    [[(2, 0, 0, 0, 0, 1) (1, 0, 3, 2, 1, 2) (3, 1, 3, 0, 3, 1)]
     [(1, 2, 1, 1, 0, 3) (3, 0, 3, 2, 3, 1) (1, 3, 1, 1, 3, 3)]
     [(0, 2, 3, 3, 1, 1) (0, 1, 1, 1, 3, 0) (0, 3, 3, 3, 1, 0)]]

I need to remove duplicates from each entry, and (optionally) sort each of them, so that, after operating on v, I have another array that looks like

[[(0, 1, 2) (0, 1, 2, 3) (0, 1, 3)]
 [(0, 1, 2, 3) (0, 1, 2, 3) (1, 3)]
 [(0, 1, 2, 3) (0, 1, 3) (0, 1, 3)]]

My hunch would be numpy.unique, but I can't make it work. Any ideas?

Something along the line of [this answer](http://stackoverflow.com/a/32381082/3962537)? — Dan Mašek, Apr 23 '16 at 01:08
not completely numpy but >>> names = v.dtype.names >>> [np.unique(v[i]) for i in v.dtype.names] will give you a list of arrays or to combine and yield an array of dtype=object w = np.array([np.unique(v[i]).tolist() for i in v.dtype.names]) >>> w array([[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 3], [0, 1, 2, 3], [0, 1, 3], [0, 1, 2, 3]], dtype=object) — , Apr 23 '16 at 01:22

score 1 · Answer 1 · answered Apr 23 '16 at 01:18

What about something like:

v = np.array(
    [[(2, 0, 0, 0, 0, 1), (1, 0, 3, 2, 1, 2), (3, 1, 3, 0, 3, 1)],
     [(1, 2, 1, 1, 0, 3), (3, 0, 3, 2, 3, 1), (1, 3, 1, 1, 3, 3)],
     [(0, 2, 3, 3, 1, 1), (0, 1, 1, 1, 3, 0), (0, 3, 3, 3, 1, 0)]])


def uniqueify(obj):
    if isinstance(obj[0], np.ndarray):
        return np.array([uniqueify(e) for e in obj])
    else:
        return np.unique(obj)


v2 = uniqueify(v)
print(v2)

Output:

[[array([0, 1, 2]) array([0, 1, 2, 3]) array([0, 1, 3])]
 [array([0, 1, 2, 3]) array([0, 1, 2, 3]) array([1, 3])]
 [array([0, 1, 2, 3]) array([0, 1, 3]) array([0, 1, 3])]]

Note: jagged arrays can be weird. You're about as good off if you simply created (python) lists (of lists) of arrays, for example:

def uniqueify(obj):
    if isinstance(obj[0], np.ndarray):
        return [uniqueify(e) for e in obj]
    else:
        return np.unique(obj)

Which produces generally the same thing, but using python lists to contain the numpy arrays:

[[array([0, 1, 2]), array([0, 1, 2, 3]), array([0, 1, 3])], [array([0, 1, 2, 3]), array([0, 1, 2, 3]), array([1, 3])], [array([0, 1, 2, 3]), array([0, 1, 3]), array([0, 1, 3])]]

Or with manual formatting:

[[array([0, 1, 2]), array([0, 1, 2, 3]), array([0, 1, 3])], 
 [array([0, 1, 2, 3]), array([0, 1, 2, 3]), array([1, 3])], 
 [array([0, 1, 2, 3]), array([0, 1, 3]), array([0, 1, 3])]]

I agree, jagged arrays are weird. One way to overcome that would be to pad them out with "NA" values using masked arrays. — John Zwinck, Apr 23 '16 at 01:30
Your answer worked after I replaced my original `v` definition with the following: `viz1 = np.zeros((L,L), dtype='(1,6)int8' )`. Then I get the same `v2` as you've got. Thanks for that. I would also like to get another array, the elements of which are the number of elements in each of the elements of `v2` (if I made myself clear...)? — Luiz Eleno, Apr 23 '16 at 10:15

score 0 · Accepted Answer · edited May 23 '17 at 11:45

This use of set works:

In [111]: np.array([tuple(set(i)) for i in v.ravel().tolist()]).reshape(3,3)
Out[111]: 
array([[(0, 1, 2), (0, 1, 2, 3), (0, 1, 3)],
       [(0, 1, 2, 3), (0, 1, 2, 3), (1, 3)],
       [(0, 1, 2, 3), (0, 1, 3), (0, 1, 3)]], dtype=object)

I've returned a 2d array of tuples (dtype object). I did not preserve the structured array dtypes. I could just as well returned an array of sets, or a list of sets.

Or with tolist a nested list of tuples

In [112]: _.tolist()
Out[112]: 
[[(0, 1, 2), (0, 1, 2, 3), (0, 1, 3)],
 [(0, 1, 2, 3), (0, 1, 2, 3), (1, 3)],
 [(0, 1, 2, 3), (0, 1, 3), (0, 1, 3)]]

I don't need the original tolist; iteration on the raveled array is enough

In [115]: [set(i) for i in v.ravel()]
Out[115]: 
[{0, 1, 2},
 {0, 1, 2, 3},
 {0, 1, 3},
 {0, 1, 2, 3},
 {0, 1, 2, 3},
 {1, 3},
 {0, 1, 2, 3},
 {0, 1, 3},
 {0, 1, 3}]

unique gives the same thing; I can't do np.unique(i) since that tries to work with the whole 1 element structured array:

In [117]: [np.unique(i.tolist()) for i in v.ravel()]
Out[117]: 
[array([0, 1, 2]),
 array([0, 1, 2, 3]),
 array([0, 1, 3]),
 array([0, 1, 2, 3]),
 array([0, 1, 2, 3]),
 array([1, 3]),
 array([0, 1, 2, 3]),
 array([0, 1, 3]),
 array([0, 1, 3])]

=======================

This converts it to a 3d array

In [134]: v1=v.view(np.dtype('(6,)i4'))

In [135]: v1
Out[135]: 
array([[[2, 0, 0, 0, 0, 1],
        [1, 0, 3, 2, 1, 2],
        [3, 1, 3, 0, 3, 1]],

       [[1, 2, 1, 1, 0, 3],
        [3, 0, 3, 2, 3, 1],
        [1, 3, 1, 1, 3, 3]],

       [[0, 2, 3, 3, 1, 1],
        [0, 1, 1, 1, 3, 0],
        [0, 3, 3, 3, 1, 0]]])

I'm not sure this helps, though. Applying unique to the last dimension has the same issues as with the structured form.

In [137]: [np.unique(i) for i in v1.reshape(-1,6)]

===================== What I wrote below is for a 1d structured array. The example is 2d. Of course it could be flattened and all that applies.

My first thought was to transform this to a list and apply set to each tuple. It's a structured array, so v.tolist() will be a list of tuples.

Something along that line was my first suggestion in the link that Dan found:

https://stackoverflow.com/a/32381082/901925

(the focus there is on the count; the bincount solutions won't help here.).

 [set(i) for i in v.tolist()]

You may not even need to translate it, though I'd have to test it. I don't know off hand if a structured record will work as an argument to set.

 [set(i) for i in v]

Regardless the result will be a list of items of different length. Whether they are sets, lists or arrays isn't important. Only they won't be structured arrays - unless we take the extra effort to identify which fields are unique.

Since the fields are all the same dtype, it would be easy to convert this to a 2d array.

 v.view(int, 6)  # 6 fields

should do the trick (needs testing). (Correction, converting this to a pure int array isn't as easy as I thought).

np.unique should work as well as set; however I suspect set is faster for 6 values (or any other reasonable number of fields).

Remove duplicate values from numpy structured array

2 Answers2